Stat Assignment

Predicting Earnings from Census Data

The United States government (like any other country) periodically collects demographic information by conducting a census. In this assignment, we are going to use census information about an individual to predict how much a person earns -- in particular, whether the person earns more than $50,000 per year. This data comes from the file [login to view URL] that contains 1994 census data for 31,978 individuals in the United States (you can download the file from the the same folder as the R code used in class). The dataset includes the following 13 variables:

• age = the age of the individual in years

• workclass = the classification of the individual's working status (does the person work for the federal government, work for the local government, work without pay, and so on)

• education = the level of education of the individual (e.g., 5th-6th grade, high school graduate, PhD, so on)

• maritalstatus = the marital status of the individual

• occupation = the type of work the individual does (e.g., administrative/clerical work, farming/fishing, sales and so on)

• relationship = relationship of individual to his/her household

• race = the individual's race

• sex = the individual's sex

• capitalgain = the capital gains of the individual in 1994 (from selling an asset such as a stock or bond for more than the original purchase price)

• capitalloss = the capital losses of the individual in 1994 (from selling an asset such as a stock or bond for less than the original purchase price)

• hoursperweek = the number of hours the individual works per week

• nativecountry = the native country of the individual

• over50k = whether or not the individual earned more than $50,000 in 1994

Part 1

Before building a random forest model down-sample our training set. While some modern personal computers can build a random forest model on the entire training set, others might run out of memory when trying to train the model since random forests is much more computationally intensive than Logistic Regression. For this reason, before continuing define a new training set to be used when building a random forest model, that contains 2000 randomly selected observations from the original training set. Do this by running the following commands in your R console (assuming your training set is called "censusTrain"):

[login to view URL](1)

trainSmall = censusTrain[sample(nrow(censusTrain), 2000), ]

Now build a random forest model to predict "over50k", using the dataset "trainSmall" as the data used to build the model. Set the seed to 1 again right before building the model, and use all of the other variables in the dataset as independent variables. (If you get an error that random forest "cannot handle categorical predictors with more than 32 categories", re-build the model without the nativecountry variable as one of the independent variables.)

Then, make predictions using this model on the entire test set. What is the accuracy of the model on the test set, using a threshold of 0.5?

(Remember that you don't need a "type" argument when making predictions with a random forest model if you want to use a threshold of 0.5).

Part 2

Apply logistic regression using the code that you’ve written previously to the same datasets for training and testing. Use again a threshold of p = 0.5. What is the accuracy of the model on the testing set?

(You might see a warning message when you make predictions on the test set - you can safely ignore it.)

Please submit the report that includes two confusion matrices: one for each method and two numbers: accuracy of the random forest and accuracy of the logistic regression for this test and your implementation of the random forest algorithm.

Skills: R Programming Language, Statistics, Statistical Analysis, Mathematics, Data Processing

See more: logistic regression, excel logistic regression vba, script logistic regression, statistics homework helper reviews, statistics assignment for mba students, basic statistics assignment, statistics assignment topics, statistics for managers assignment, conclusion to statistics assignment, statistics assignment answers, statistics assignment solution, programming logistic regression, logistic regression using programming, logistic regression project, logistic regression mysql, using logistic regression, logistic regression using statistics, can logistic regression mysql database, multivariate logistic regression matlab code, multi nominal logistic regression

About the Employer:
( 0 reviews ) Pune, India

Project ID: #24494739

Awarded to:


My profile is as follows:- 4 years of experience in using statistical analysis,machine learning and artificial intelligence to solve complex problems - Proficient in R programming  , Python, SQL,Matlab ; Hands on expe More

₹7000 INR in 7 days
(56 Reviews)

14 freelancers are bidding on average ₹4089 for this job


Hi I am a very experienced statistician, data scientist and academic writer. I have completed several PhD level thesis projects involving advanced statistical analysis of data. I have worked with data from several comp More

₹7500 INR in 7 days
(141 Reviews)

1.I am an expert in Statistics, regression analysis, using both Excel and SPSS, also expert in Excel, Excel formulas, and all excel functions, macros, lookup, pivot tables and charts. [login to view URL] done many projects in Exce More

₹9000 INR in 3 days
(37 Reviews)

Dear client I am R programmer, statistical analyst and data scientist. I have experience in probability, statistical inference, hypothesis testing, statistical modelling and machine learning. I have wide experience in More

₹5000 INR in 2 days
(6 Reviews)

Hi, I'm a final year undergraduate at the University of Colombo following Statistics with Computer science special degree. I have a good experience in aR programming. I have already done a data analysis on this datase More

₹1000 INR in 2 days
(1 Review)

I tok to me

₹1300 INR in 1 day
(0 Reviews)

Hi I am SANDIP I have read your work details and it matches my skills. I can do data analytics, data mining and interpreting data or anything as you wish and based on your want and need and data entry. I am also profic More

₹1050 INR in 7 days
(0 Reviews)

hi m hina i can do this work honestly i have 7 years experince in compture knowldge i will do it very well

₹8888 INR in 6 days
(0 Reviews)

This sample was created using the design theme Proposal Pack Education #1. You can recreate this same sample using any of our Proposal Pack design themes and have it customized for your business. Relevant Skills and E More

₹1300 INR in 1 day
(0 Reviews)

Hello, my name is Vijay Teraiya. I am a dedicated and hard working person who believes in honesty and good working relation. Though I am new at this sector of job but I have certain qualities which makes me good at thi More

₹1050 INR in 7 days
(0 Reviews)

With a major in advanced statistics and minor in economics from IIT Kanpur, i am well suited to handle the nuances of this project including data handling in R and insights from the results presented using MS Office. M More

₹1050 INR in 9 days
(0 Reviews)

I am an expert and Technical analyst for the problem you have given on this portal I have study data science from NIT Allahabad which is renowned in India and I would be able to complete your task on the required time. More

₹11111 INR in 24 days
(0 Reviews)
₹1300 INR in 1 day
(0 Reviews)

Hi there, Hope you're doing well. I am a Data Enthusiast who has a lot of experience in machine learning and Statistical analysis using R, Python and Excel. Attaching the report I recently wrote on a project done in R More

₹700 INR in 3 days
(0 Reviews)