Recently on Kaggle, the Census Bureau released an enormous data file of every response to the American Community Survey. As you may know, the ACS is one of the gold standards in sociological research. Dozens of questions on every facet of American life. ~630 mb and broken into two csv files, it has 1611965 observations of 284. It’s a big boy. There’s so much to be found in here, and I was eager to do something, so I skimmed the codebook and found something neat. I zeroed in on two variables: NAICSP and JWAP. The former is a code for the industry that the respondent works in. It’s fairly atomized, with 268 unique job codes on board. JWAP, on the other hand, is the respondant’s work start time, broken into 5 minute intervals throughout the day. I thought it’d be pretty interesting to take a look at the average start time of each of the industries included and see which ones are the earliest risers and the latest starters.
Up front, I’m expecting the early categories to be dominated by police and fire, agricultural workers, military, and maybe bankers? But we won’t know until we look!
I usually do my work in the knitr package before I upload it (which I can’t recommend enough) but one of the pluses and minueses of that is that you have to run your whole script through knitr from start to finish to generate a report. I tried it once, but because the initial dataset is so large my computer got really angry at me, so here’s what we’re gonna do. Pretend that I already did all this:
acs2 <- read.csv("~/Desktop/Data_Science/ACS/Data/ss14pusb.csv") acs <- read.csv("~./Desktop/Data_Science/ACS/Data/ss14pusa.csv”) acsComp <- data.frame(rbind(acs,acs2)) startTime <- data.frame(acsCompJWAP) startTime <- startTime[which(complete.cases(startTime) == TRUE),]
Basically just loading the two separate .csv files, binding them, cutting out only the start time and occupation variables, and keeping only complete cases. I did that in actuality, but to save myself some memory I’m going to save startTime as its own .csv, then reload it.
startTime <- read.csv("~/Desktop/Data_Science/ACS/Data/startTime.csv") colnames(startTime) <- c("Job.ID","Time.ID")
After that, we copy paste the description from the codebook for JWAP, put it into a text file, delete page numbers,and give it a header (“ID”, “Times”, “x1”, “x2”, “x3”, “x4”). The last three are because for the sake of brevity and clarity, we’re going to round everyone’s start time up. By doing this, we can let read.delim do some of the work for us in terms of data cleaning. Here’s why. In the code book, you’ll note the observations are made like so:
001 .12:00 a.m. to 12:04 a.m.
If we just make the argument to our “sep =” command in read.delim a space, we break everything into 6 easy columns ([1,] = “001”, “.12:00”, “a.m.”, “to”, “11:59”, “p.m”) Paste 2 and 3 together, drop the last three, and baby, you’ve got a stew goin.
times <- read.delim("~/Desktop/Data_Science/ACS/Data/times.txt", sep=" ", header=TRUE) times <- times[,-c(4:6)] library(dplyr)
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
cleanTimes <- transmute(times, paste(Times, x1, sep= " ")) times <- data.frame(times$ID,cleanTimes) colnames(times) <- c("Time.ID", "Times") times$Times <- substring(times$Times,2) knitr::kable(head(times), caption="Times Table")
Clean as a whistle. Now we have to do the same for the names of all the job occupations. Pretty similar idea here, except the typical format for the data looks like this
111 .AGR-CROP PRODUCTION
In this case, each job has a unique ID, but in its description it’s lead by a three character code for the general industry the job falls under. We’ll probably find that useful later so, I want to preserve both. What I figured we’ll do here is first use the . as our sep to read in the data, and then I’ll use a regular expression to make a new column out of the industry codes. That required me to manually go through and remove all wayward periods in the codebook that weren’t where I wanted to break a column (“928110P2 .MIL-U.S. AIR FORCE”, for example), and to give the Unemployed an industry ID (“XXX”).
|112||ANIMAL PRODUCTION AND AQUACULTURE||AGR|
|113M||FORESTRY EXCEPT LOGGING||AGR|
|114||FISHING, HUNTING, AND TRAPPING||AGR|
|115||SUPPORT ACTIVITIES FOR AGRICULTURE AND FORESTRY||AGR|
Beautiful. At this point I’m actually pretty amazed at how easy it’s been. Props to the Census for making this so easy to use. Back to the big dataset.
startTime <- left_join(startTime, times, by="Time.ID") ## Join wasn't working for jobs and startTime. After a little investigation we find that the levels for jobs$Job.ID have white space. Tricky tricky. jobsLevels <- attr(jobs$Job.ID, which = "levels") trim.trailing <- function (x) sub("\\s+$", "", x) jobsLevels <- trim.trailing(jobsLevels) attr(jobs$Job.ID, which = "levels") <- jobsLevels startTime <- left_join(startTime,jobs, by="Job.ID")
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining ## factors with different levels, coercing to character vector
knitr::kable(head(startTime), caption="Fully Joined Table")
|8111Z||73||6:15 a.m.||AUTOMOTIVE REPAIR AND MAINTENANCE||SRV|
|45211||94||8:00 a.m.||DEPARTMENT AND DISCOUNT STORES||RET|
|3261||95||8:05 a.m.||PLASTICS PRODUCTS||MFG|
|3219ZM||96||8:10 a.m.||MISCELLANEOUS WOOD PRODUCTS||MFG|
|3399M||44||3:50 a.m.||SPORTING AND ATHLETIC GOODS, AND DOLL, TOY, AND GAME MANUFACTURING||MFG|
|31M||65||5:35 a.m.||KNITTING FABRIC MILLS, AND APPAREL KNITTING MILLS||MFG|
Nice. Now let’s run some numbers. We’ll need to get our times into POSIXlt so we can work with them.
startTime$Times <- strptime(startTime$Times, format = "%H:%M") earliestRisers <- aggregate(startTime$Times, by=list(startTime$Job), mean) colnames(earliestRisers) <- c("Job", "Start.Times") earliestRisers <- arrange(earliestRisers, Start.Times) topTen <- earliestRisers[1:10,] gs.pal <- colorRampPalette(c("lightblue","black", "forestgreen", "steelblue", "forestgreen"),bias=.01,space="rgb") legend_ord <- levels(with(topTen, reorder(Job,Start.Times))) boxPlot <- startTime[which(startTime$Job %in% topTen$Job),]
ggplot(aes(x=Job,y=Times, color=Job), data=boxPlot) + geom_boxplot(ymin=1,ymax=1, outlier.colour = "red", outlier.shape = 1) + coord_flip() + theme(panel.background = element_rect(fill = 'grey95'), plot.title = element_text(size=12,lineheight=.8, vjust=1,family="Calibri Light"), axis.title = element_text(size=10,lineheight=.8, vjust=1,family="Calibri Light"), panel.grid.major = element_line(size=.1, color="black"), panel.grid.minor = element_line(size=0, color ="white"), axis.ticks.y=element_blank(), axis.text.y=element_blank(), axis.ticks.x=element_blank()) + labs(x="", y="Average Start Time") + ggtitle("Ten Earliest Rising Jobs in America")
Well, we weren’t too far off in our assumptions. Agriculture is heavily represented , as are the military and classic industrial work. Interesting find! We’ll be digging into this in more depth at a later time.