You may discuss homework problems with other students, but you have to prepare the written assignments yourself.
Please combine all your answers, the computer code and the figures into one file, and submit a copy in your dropbox on Gradescope.
Due date: 11:59 PM, June 5, 2024.
The goal here is to have you make an effort to do a complete data analysis with what you've learned. Students who put in a good effort will receive 100%. Some effort will receive 70%.
If you have not installed LaTeX on your computer. After running the below commands (once is enough), then using either Quarto or RMarkdown formats should hopefully be sufficient to build directly to PDF.
install.packages('tinytex', repos='http://cloud.r-project.org')
tinytex::install_tinytex()
The data for your final project is based on real estate sales in Ames, Iowa in the years 2006- 2010. A description of the dataset can be found here.
I have created a subsample of 2000 cases for you to use after fixing some missing values.
To begin, randomly split the data into a two sets of equal size --
1000 for selecting a model, with a final 1000 for validation, and
reporting confidence intervals for the final effects. For
reproducibility of your results, pick and store an integer seed to use
for the split and any subsequent possibly randomization as in
cross-validation, etc. For simplicity, I'll choose the seed 1
here. You need not use the same seed, choose one and have this line as
the first line in your analysis.
set.seed(1)
Your task is to build a model to predict log(SalePrice)
based on the remaining variables. In addition,
we want estimated effects for:
Total square feet: Gr.Live.Area + Total.Bsmt.SF
Number of bedrooms: Bedroom
Beware: the data set is large enough so simple stepwise model building procedures may be very slow.
The project should have the following parts:
study underlying their dataset.
the data that will allow them to answer some of the specific goals of
the study. Fit the model on the first half
of your 1000 points reserved for selection. Record the performance of your model on the remaining 1000 points
for validation by computing MSE=sum(resid(initial.model)^2)
.
variable plots to search for non-linear effects. You can also consider
any potential interactions. Use some of the diagnostic tools seen in class to assess
the quality of fit of your selected model. Record the performance of your model on the remaining 1000 points
for validation by computing MSE=sum(resid(selected.model)^2)
. Is it much better than your initial
model?
on your selected model. Use the 1000 points held for validation to report this effect. Include some measure of variability of the estimate, such as a confidence interval. Would you have gotten a very different answer if you had used your initial model instead of the selected model to estimate the effects?
Dean De Cock. Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistical Education, 19(3), 2011.