# Mining the American Community Survey Pt.1: Sorry I Can’t, I’ve Got Work in The Morning

Recently on Kaggle, the Census Bureau released an enormous data file of every response to the American Community Survey. As you may know, the ACS is one of the gold standards in sociological research. Dozens of questions on every facet of American life. ~630 mb and broken into two csv files, it has 1611965 observations of 284. It’s a big boy. There’s so much to be found in here, and I was eager to do something, so I skimmed the codebook and found something neat. I zeroed in on two variables: NAICSP and JWAP. The former is a code for the industry that the respondent works in. It’s fairly atomized, with 268 unique job codes on board. JWAP, on the other hand, is the respondant’s work start time, broken into 5 minute intervals throughout the day. I thought it’d be pretty interesting to take a look at the average start time of each of the industries included and see which ones are the earliest risers and the latest starters.

Up front, I’m expecting the early categories to be dominated by police and fire, agricultural workers, military, and maybe bankers? But we won’t know until we look!

I usually do my work in the knitr package before I upload it (which I can’t recommend enough) but one of the pluses and minueses of that is that you have to run your whole script through knitr from start to finish to generate a report. I tried it once, but because the initial dataset is so large my computer got really angry at me, so here’s what we’re gonna do. Pretend that I already did all this:

acs2 <- read.csv("~/Desktop/Data_Science/ACS/Data/ss14pusb.csv")

acsComp <- data.frame(rbind(acs,acs2))

startTime <- data.frame(acsCompNAICSP,acsCompNAICSP,acsCompJWAP)

startTime <- startTime[which(complete.cases(startTime) == TRUE),]

Basically just loading the two separate .csv files, binding them, cutting out only the start time and occupation variables, and keeping only complete cases. I did that in actuality, but to save myself some memory I’m going to save startTime as its own .csv, then reload it.

Later on, I make some text files out of information from the dataset’s codebook to help wrangle the data. Those are saved here and here, in case you want to try this at home once you’re done reading.

startTime <- read.csv("~/Desktop/Data_Science/ACS/Data/startTime.csv")
colnames(startTime) <- c("Job.ID","Time.ID")

After that, we copy paste the description from the codebook for JWAP, put it into a text file, delete page numbers,and give it a header (“ID”, “Times”, “x1”, “x2”, “x3”, “x4”). The last three are because for the sake of brevity and clarity, we’re going to round everyone’s start time up. By doing this, we can let read.delim do some of the work for us in terms of data cleaning. Here’s why. In the code book, you’ll note the observations are made like so:

001 .12:00 a.m. to 12:04 a.m.

If we just make the argument to our “sep =” command in read.delim a space, we break everything into 6 easy columns ([1,] = “001”, “.12:00”, “a.m.”, “to”, “11:59”, “p.m”) Paste 2 and 3 together, drop the last three, and baby, you’ve got a stew goin.

times <- read.delim("~/Desktop/Data_Science/ACS/Data/times.txt", sep=" ", header=TRUE)
times <- times[,-c(4:6)]
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
cleanTimes <- transmute(times, paste(Times, x1, sep= " "))
times <- data.frame(times$ID,cleanTimes) colnames(times) <- c("Time.ID", "Times") times$Times <- substring(times$Times,2) knitr::kable(head(times), caption="Times Table") Times Table Time.ID Times 1 12:00 a.m. 2 12:05 a.m. 3 12:10 a.m. 4 12:15 a.m. 5 12:20 a.m. 6 12:25 a.m. Clean as a whistle. Now we have to do the same for the names of all the job occupations. Pretty similar idea here, except the typical format for the data looks like this 111 .AGR-CROP PRODUCTION In this case, each job has a unique ID, but in its description it’s lead by a three character code for the general industry the job falls under. We’ll probably find that useful later so, I want to preserve both. What I figured we’ll do here is first use the . as our sep to read in the data, and then I’ll use a regular expression to make a new column out of the industry codes. That required me to manually go through and remove all wayward periods in the codebook that weren’t where I wanted to break a column (“928110P2 .MIL-U.S. AIR FORCE”, for example), and to give the Unemployed an industry ID (“XXX”). Jobs Table Job.ID Job Industry 111 CROP PRODUCTION AGR 112 ANIMAL PRODUCTION AND AQUACULTURE AGR 1133 LOGGING AGR 113M FORESTRY EXCEPT LOGGING AGR 114 FISHING, HUNTING, AND TRAPPING AGR 115 SUPPORT ACTIVITIES FOR AGRICULTURE AND FORESTRY AGR Beautiful. At this point I’m actually pretty amazed at how easy it’s been. Props to the Census for making this so easy to use. Back to the big dataset. startTime <- left_join(startTime, times, by="Time.ID") ## Join wasn't working for jobs and startTime. After a little investigation we find that the levels for jobs$Job.ID have white space. Tricky tricky.
jobsLevels <- attr(jobs$Job.ID, which = "levels") trim.trailing <- function (x) sub("\\s+$", "", x)
jobsLevels <- trim.trailing(jobsLevels)
attr(jobs$Job.ID, which = "levels") <- jobsLevels startTime <- left_join(startTime,jobs, by="Job.ID") ## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining ## factors with different levels, coercing to character vector knitr::kable(head(startTime), caption="Fully Joined Table") Fully Joined Table Job.ID Time.ID Times Job Industry 8111Z 73 6:15 a.m. AUTOMOTIVE REPAIR AND MAINTENANCE SRV 45211 94 8:00 a.m. DEPARTMENT AND DISCOUNT STORES RET 3261 95 8:05 a.m. PLASTICS PRODUCTS MFG 3219ZM 96 8:10 a.m. MISCELLANEOUS WOOD PRODUCTS MFG 3399M 44 3:50 a.m. SPORTING AND ATHLETIC GOODS, AND DOLL, TOY, AND GAME MANUFACTURING MFG 31M 65 5:35 a.m. KNITTING FABRIC MILLS, AND APPAREL KNITTING MILLS MFG Nice. Now let’s run some numbers. We’ll need to get our times into POSIXlt so we can work with them. startTime$Times <- strptime(startTime$Times, format = "%H:%M") earliestRisers <- aggregate(startTime$Times, by=list(startTime$Job), mean) colnames(earliestRisers) <- c("Job", "Start.Times") earliestRisers <- arrange(earliestRisers, Start.Times) topTen <- earliestRisers[1:10,] gs.pal <- colorRampPalette(c("lightblue","black", "forestgreen", "steelblue", "forestgreen"),bias=.01,space="rgb") legend_ord <- levels(with(topTen, reorder(Job,Start.Times))) boxPlot <- startTime[which(startTime$Job %in% topTen\$Job),]
ggplot(aes(x=Job,y=Times, color=Job), data=boxPlot)
+ geom_boxplot(ymin=1,ymax=1, outlier.colour = "red", outlier.shape = 1)
+ coord_flip()
+ theme(panel.background = element_rect(fill = 'grey95'),
plot.title = element_text(size=12,lineheight=.8,
vjust=1,family="Calibri Light"),
axis.title = element_text(size=10,lineheight=.8,
vjust=1,family="Calibri Light"),
panel.grid.major = element_line(size=.1, color="black"),
panel.grid.minor = element_line(size=0, color ="white"),
axis.ticks.y=element_blank(), axis.text.y=element_blank(),
axis.ticks.x=element_blank())
+ labs(x="", y="Average Start Time")
+ ggtitle("Ten Earliest Rising Jobs in America")


Well, we weren’t too far off in our assumptions. Agriculture is heavily represented , as are the military and classic industrial work. Interesting find! We’ll be digging into this in more depth at a later time.