Browsing resource, all submissions are temporary.
In this exercise, you will analyze the houses dataset from the Wiki server of Cal Poly Computer Science Department Labs. It contains a collection of real estate listings in San Luis Obispo county and around it. The csv file of the dataset is uploaded here. After downloading the file, you can load it to R using the command
house <- read.csv("RealEstate.csv")
A description of the column variables is available on this webpage. Take some time to explore the data.
a. Fit a linear model predicting the house price (in dollars) from size (in square feet), number of bedrooms and number of bathrooms. Enter the intercept, slopes, residual standard error and R2 to 4 significant figures.
Residual standard error = R2 =
b. From the p-values calculated by the lm() function, are all slopes (Size, Bedrooms and Bathrooms) statistically significant? yes no
lm()
Make several residual plots and you will see that there are at least two outliers in the data. These outliers appear to be expensive houses. Let's focus on cheaper houses, houses with prices less than 2 million dollars. Create a subset of houses data using the command
house2 <- house[house$Price < 2e6,]
c. Compare the number of data in house2 and house. How many houses have been removed?
house2
house
d. Fit a linear model predicting the house price from Size, number of bedrooms and number of bathrooms from the house2 dataset. Which of the following statements are true? (Select all that apply) The slope for 'Bathrooms' is no longer statistically significant (i.e. its p-value > 0.05). R2 is larger than that in part (a). The residual standard error is larger than that in part (a). The regression coefficients (intercept and slopes) do not change much (< 1%) compared to those in part (a). At least one slope is statistically significant according to the F statistic.
Make a few residual plots for the second linear model. You will discover that there is one point with unusually large negative residual. You can easily track down the point using the which.min() function. Looking at this particular data point, you should see that this house is relatively large (Size = 6098 square feet), but the price/sq. feet seems a little low. Type house2[house2$Size > 5000,] to look at houses with Size > 5000 square feet. You will discover that this house has indeed unusually low price/sq. feet compared to other big houses. Perhaps this house is very old or in very bad condition? It is hard to tell from the data alone, but this is obviously an outlier.
which.min()
house2[house2$Size > 5000,]
e. Remove this outlier from the data frame and save it as house3. Then fit a linear model predicting the house price from Size, number of bedrooms and number of bathrooms from the house3 dataset. Which of the following statements are true? (Select all that apply) At least one slope is statistically significant according to the F statistic. The residual standard error is larger than that in part (d). R2 is larger than that in part (d). The slope for 'Bathrooms' is no longer statistically significant (i.e. its p-value > 0.05).
house3
f. Fill in the following table the correlation matrix between the 4 variables in the house3 data. Enter the coefficients to 4 decimal places. Here "sym" means that the value can be inferred from the symmetry of the correlation matrix.
g. Fit a linear model predicting the house price from Size and number of bedrooms from the house3 dataset (i.e. remove 'Bathrooms' from the model). Which of the following statements are true? (Select all that apply) R2 is almost the same as that in part (e). The slope for 'Bedrooms' is no longer significant (i.e. its p-value > 0.05). All slopes are significant according to the p-values. The residual standard error is almost the same as that in part (e). At least one slope is statistically significant according to the F statistic.
Remark: One of the purposes of this exercise is to show you that the regression result could be greatly affected by a few outliers. When dealing with real-life data analysis, care must be taken on the outliers. We should first investigate the origin of the outliers. Are they caused by recording mistakes, incorrect information or real data? Sometimes the outlier data may be more interesting than the rest of the data. Then we need to decide whether or not to keep the outliers, which depends largely on the goal of the analysis. If we decide to remove the outliers, we should state the reasons and document the processes in detail in the data analysis report. Usually, we will report the results with and without the outliers.