415fc4ea-354a-428a-a00a-d50ebb3ddcf9-map

Mining the American Community Survey Pt.1: Sorry I Can’t, I’ve Got Work in The Morning

Recently on Kaggle, the Census Bureau released an enormous data file of every response to the American Community Survey. As you may know, the ACS is one of the gold standards in sociological research. Dozens of questions on every facet of American life. ~630 mb and broken into two csv files, it has 1611965 observations of 284. It’s a big boy. There’s so much to be found in here, and I was eager to do something, so I skimmed the codebook and found something neat. I zeroed in on two variables: NAICSP and JWAP. The former is a code for the industry that the respondent works in. It’s fairly atomized, with 268 unique job codes on board. JWAP, on the other hand, is the respondant’s work start time, broken into 5 minute intervals throughout the day. I thought it’d be pretty interesting to take a look at the average start time of each of the industries included and see which ones are the earliest risers and the latest starters.

Up front, I’m expecting the early categories to be dominated by police and fire, agricultural workers, military, and maybe bankers? But we won’t know until we look!

I usually do my work in the knitr package before I upload it (which I can’t recommend enough) but one of the pluses and minueses of that is that you have to run your whole script through knitr from start to finish to generate a report. I tried it once, but because the initial dataset is so large my computer got really angry at me, so here’s what we’re gonna do. Pretend that I already did all this:

acs2 <- read.csv("~/Desktop/Data_Science/ACS/Data/ss14pusb.csv") 

acs <- read.csv("~./Desktop/Data_Science/ACS/Data/ss14pusa.csv”) 

acsComp <- data.frame(rbind(acs,acs2))

startTime <- data.frame(acsCompNAICSP,acsCompNAICSP,acsCompJWAP) 

startTime <- startTime[which(complete.cases(startTime) == TRUE),]

Basically just loading the two separate .csv files, binding them, cutting out only the start time and occupation variables, and keeping only complete cases. I did that in actuality, but to save myself some memory I’m going to save startTime as its own .csv, then reload it.

Later on, I make some text files out of information from the dataset’s codebook to help wrangle the data. Those are saved here and here, in case you want to try this at home once you’re done reading.

startTime <- read.csv("~/Desktop/Data_Science/ACS/Data/startTime.csv")
colnames(startTime) <- c("Job.ID","Time.ID")

After that, we copy paste the description from the codebook for JWAP, put it into a text file, delete page numbers,and give it a header (“ID”, “Times”, “x1”, “x2”, “x3”, “x4”). The last three are because for the sake of brevity and clarity, we’re going to round everyone’s start time up. By doing this, we can let read.delim do some of the work for us in terms of data cleaning. Here’s why. In the code book, you’ll note the observations are made like so:

001 .12:00 a.m. to 12:04 a.m.

If we just make the argument to our “sep =” command in read.delim a space, we break everything into 6 easy columns ([1,] = “001”, “.12:00”, “a.m.”, “to”, “11:59”, “p.m”) Paste 2 and 3 together, drop the last three, and baby, you’ve got a stew goin.

times <- read.delim("~/Desktop/Data_Science/ACS/Data/times.txt", sep=" ", header=TRUE)
times <- times[,-c(4:6)]
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cleanTimes <- transmute(times, paste(Times, x1, sep= " "))
times <- data.frame(times$ID,cleanTimes)
colnames(times) <- c("Time.ID", "Times")
times$Times <- substring(times$Times,2)
knitr::kable(head(times), caption="Times Table")
Times Table
Time.ID Times
1 12:00 a.m.
2 12:05 a.m.
3 12:10 a.m.
4 12:15 a.m.
5 12:20 a.m.
6 12:25 a.m.

Clean as a whistle. Now we have to do the same for the names of all the job occupations. Pretty similar idea here, except the typical format for the data looks like this

111 .AGR-CROP PRODUCTION

In this case, each job has a unique ID, but in its description it’s lead by a three character code for the general industry the job falls under. We’ll probably find that useful later so, I want to preserve both. What I figured we’ll do here is first use the . as our sep to read in the data, and then I’ll use a regular expression to make a new column out of the industry codes. That required me to manually go through and remove all wayward periods in the codebook that weren’t where I wanted to break a column (“928110P2 .MIL-U.S. AIR FORCE”, for example), and to give the Unemployed an industry ID (“XXX”).

Jobs Table
Job.ID Job Industry
111 CROP PRODUCTION AGR
112 ANIMAL PRODUCTION AND AQUACULTURE AGR
1133 LOGGING AGR
113M FORESTRY EXCEPT LOGGING AGR
114 FISHING, HUNTING, AND TRAPPING AGR
115 SUPPORT ACTIVITIES FOR AGRICULTURE AND FORESTRY AGR

Beautiful. At this point I’m actually pretty amazed at how easy it’s been. Props to the Census for making this so easy to use. Back to the big dataset.

startTime <- left_join(startTime, times, by="Time.ID")
## Join wasn't working for jobs and startTime. After a little investigation we find that the levels for jobs$Job.ID have white space. Tricky tricky.
jobsLevels <- attr(jobs$Job.ID, which = "levels")
trim.trailing <- function (x) sub("\\s+$", "", x)
jobsLevels <- trim.trailing(jobsLevels)
attr(jobs$Job.ID, which = "levels") <- jobsLevels
startTime <- left_join(startTime,jobs, by="Job.ID")
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
knitr::kable(head(startTime), caption="Fully Joined Table")
Fully Joined Table
Job.ID Time.ID Times Job Industry
8111Z 73 6:15 a.m. AUTOMOTIVE REPAIR AND MAINTENANCE SRV
45211 94 8:00 a.m. DEPARTMENT AND DISCOUNT STORES RET
3261 95 8:05 a.m. PLASTICS PRODUCTS MFG
3219ZM 96 8:10 a.m. MISCELLANEOUS WOOD PRODUCTS MFG
3399M 44 3:50 a.m. SPORTING AND ATHLETIC GOODS, AND DOLL, TOY, AND GAME MANUFACTURING MFG
31M 65 5:35 a.m. KNITTING FABRIC MILLS, AND APPAREL KNITTING MILLS MFG

Nice. Now let’s run some numbers. We’ll need to get our times into POSIXlt so we can work with them.

startTime$Times <- strptime(startTime$Times, format = "%H:%M")
earliestRisers <- aggregate(startTime$Times, by=list(startTime$Job), mean)

colnames(earliestRisers) <- c("Job", "Start.Times")
earliestRisers <- arrange(earliestRisers, Start.Times)
topTen <- earliestRisers[1:10,]
gs.pal <- colorRampPalette(c("lightblue","black", "forestgreen", "steelblue", "forestgreen"),bias=.01,space="rgb")
legend_ord <- levels(with(topTen, reorder(Job,Start.Times)))
boxPlot <- startTime[which(startTime$Job %in% topTen$Job),] 
ggplot(aes(x=Job,y=Times, color=Job), data=boxPlot) 
+ geom_boxplot(ymin=1,ymax=1, outlier.colour = "red", outlier.shape = 1) 
+ coord_flip() 
+ theme(panel.background = element_rect(fill = 'grey95'), 
  plot.title = element_text(size=12,lineheight=.8, 
  vjust=1,family="Calibri Light"), 
  axis.title = element_text(size=10,lineheight=.8, 
               vjust=1,family="Calibri Light"), 
  panel.grid.major = element_line(size=.1, color="black"), 
  panel.grid.minor = element_line(size=0, color ="white"), 
  axis.ticks.y=element_blank(), axis.text.y=element_blank(), 
  axis.ticks.x=element_blank()) 
+ labs(x="", y="Average Start Time") 
+ ggtitle("Ten Earliest Rising Jobs in America") 

erlRise.png

Well, we weren’t too far off in our assumptions. Agriculture is heavily represented , as are the military and classic industrial work. Interesting find! We’ll be digging into this in more depth at a later time.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s