Assignment 5¶

You may discuss homework problems with other students, but you have to prepare the written assignments yourself.

Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on Gradescope.

Due date: 11:59 PM, June 5, 2024.

Grading¶

The goal here is to have you make an effort to do a complete data analysis with what you've learned. Students who put in a good effort will receive 100%. Some effort will receive 70%.

Building PDF¶

If you have not installed LaTeX on your computer. After running the below commands (once is enough), then using either Quarto or RMarkdown formats should hopefully be sufficient to build directly to PDF.

install.packages('tinytex', repos='http://cloud.r-project.org')
tinytex::install_tinytex()

Download¶

RStudio: RMarkdown, Quarto
Jupyter

Project¶

The data for your final project is based on real estate sales in Ames, Iowa in the years 2006- 2010. A description of the dataset can be found here.

I have created a subsample of 2000 cases for you to use after fixing some missing values.

Split your data¶

To begin, randomly split the data into a two sets of equal size -- 1000 for selecting a model, with a final 1000 for validation, and reporting confidence intervals for the final effects. For reproducibility of your results, pick and store an integer seed to use for the split and any subsequent possibly randomization as in cross-validation, etc. For simplicity, I'll choose the seed 1 here. You need not use the same seed, choose one and have this line as the first line in your analysis.

set.seed(1)

Project description¶

Your task is to build a model to predict log(SalePrice) based on the remaining variables. In addition, we want estimated effects for:

Total square feet: Gr.Live.Area + Total.Bsmt.SF
Number of bedrooms: Bedroom

Beware: the data set is large enough so simple stepwise model building procedures may be very slow.

The project should have the following parts:

The study: In this section, you should give a description of the

study underlying their dataset.

An initial model: In this section, you should develop a first-pass model for

the data that will allow them to answer some of the specific goals of the study. Fit the model on the first half of your 1000 points reserved for selection. Record the performance of your model on the remaining 1000 points for validation by computing MSE=sum(resid(initial.model)^2).

Selected model: In this section, you can look at partial residual or added

variable plots to search for non-linear effects. You can also consider any potential interactions. Use some of the diagnostic tools seen in class to assess the quality of fit of your selected model. Record the performance of your model on the remaining 1000 points for validation by computing MSE=sum(resid(selected.model)^2). Is it much better than your initial model?

Effects of interest: Report an estimate of the effect of square footage and the number of bedrooms

on your selected model. Use the 1000 points held for validation to report this effect. Include some measure of variability of the estimate, such as a confidence interval. Would you have gotten a very different answer if you had used your initial model instead of the selected model to estimate the effects?

References¶

Dean De Cock. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistical Education, 19(3), 2011.