Predicting Earnings from Census Data
The United States government (like any other country) periodically collects demographic information by conducting a census. In this assignment, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000 per year. This data comes from the file [login to view URL] that contains 1994 census data for 31,978 individuals in the United States (you can download the file from the the same folder as the R code used in class). The dataset includes the following 13 variables:
• age = the age of the individual in years
• workclass = the classification of the individual's working status (does the person work for the federal government, work for the local government, work without pay, and so on)
• education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)
• maritalstatus = the marital status of the individual
• occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)
• relationship = relationship of individual to his/her household
• race = the individual's race
• sex = the individual's sex
• capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)
• capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)
• hoursperweek = the number of hours the individual works per week
• nativecountry = the native country of the individual
• over50k = whether or not the individual earned more than $50,000 in 1994
Before building a random forest model down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than Logistic Regression. For this reason, before continuing define a new training set to be used when building a random forest model, that contains 2000 randomly selected observations from the original training set. Do this by running the following commands in your R console (assuming your training set is called "censusTrain"):
[login to view URL](1)
trainSmall = censusTrain[sample(nrow(censusTrain), 2000), ]
Now build a random forest model to predict "over50k", using the dataset "trainSmall" as the data used to build the model. Set the seed to 1 again right before building the model, and use all of the other variables in the dataset as independent variables. (If you get an error that random forest "cannot handle categorical predictors with more than 32 categories", re-build the model without the nativecountry variable as one of the independent variables.)
Then, make predictions using this model on the entire test set. What is the accuracy of the model on the test set, using a threshold of 0.5?
(Remember that you don't need a "type" argument when making predictions with a random forest model if you want to use a threshold of 0.5).
Apply logistic regression using the code that you’ve written previously to the same datasets for training and testing. Use again a threshold of p = 0.5. What is the accuracy of the model on the testing set?
(You might see a warning message when you make predictions on the test set - you can safely ignore it.)
Please submit the report that includes two confusion matrices: one for each method and two numbers: accuracy of the random forest and accuracy of the logistic regression for this test and your implementation of the random forest algorithm.