1. LON-CAPA Logo
  2. Help
  3. Log In
 

Browsing resource, all submissions are temporary.


Housing Prices

In this exercise, you will analyze the houses dataset from the Wiki server of Cal Poly Computer Science Department Labs. It contains a collection of real estate listings in San Luis Obispo county and around it. The csv file of the dataset is uploaded here. After downloading the file, you can load it to R using the command

house <- read.csv("RealEstate.csv")

A description of the column variables is available on this webpage. Take some time to explore the data.

a. Fit a linear model predicting the house price (in dollars) from size (in square feet), number of bedrooms and number of bathrooms. Enter the intercept, slopes, residual standard error and R2 to 4 significant figures.

Coefficients
Intercept
Size
Bedrooms
Bathrooms
 Tries [_1]

Residual standard error =           R2 =

 Tries [_1]

b. From the p-values calculated by the lm() function, are all slopes (Size, Bedrooms and Bathrooms) statistically significant?


 Tries [_1]

residual vs predicted price

Make several residual plots and you will see that there are at least two outliers in the data. These outliers appear to be expensive houses. Let's focus on cheaper houses, houses with prices less than 2 million dollars. Create a subset of houses data using the command

house2 <- house[house$Price < 2e6,]

c. Compare the number of data in house2 and house. How many houses have been removed?

 Tries [_1]

d. Fit a linear model predicting the house price from Size, number of bedrooms and number of bathrooms from the house2 dataset. Which of the following statements are true? (Select all that apply)
R2 is larger than that in part (a).
At least one slope is statistically significant according to the F statistic.
The slope for 'Bathrooms' is no longer statistically significant (i.e. its p-value > 0.05).
The residual standard error is larger than that in part (a).
The regression coefficients (intercept and slopes) do not change much (< 1%) compared to those in part (a).

 Tries [_1]

Make a few residual plots for the second linear model. You will discover that there is one point with unusually large negative residual. You can easily track down the point using the which.min() function. Looking at this particular data point, you should see that this house is relatively large (Size = 6098 square feet), but the price/sq. feet seems a little low. Type house2[house2$Size > 5000,] to look at houses with Size > 5000 square feet. You will discover that this house has indeed unusually low price/sq. feet compared to other big houses. Perhaps this house is very old or in very bad condition? It is hard to tell from the data alone, but this is obviously an outlier.

e. Remove this outlier from the data frame and save it as house3. Then fit a linear model predicting the house price from Size, number of bedrooms and number of bathrooms from the house3 dataset. Which of the following statements are true? (Select all that apply)
The slope for 'Bathrooms' is no longer statistically significant (i.e. its p-value > 0.05).
At least one slope is statistically significant according to the F statistic.
R2 is larger than that in part (d).
The residual standard error is larger than that in part (d).

 Tries [_1]

f. Fill in the following table the correlation matrix between the 4 variables in the house3 data. Enter the coefficients to 4 decimal places. Here "sym" means that the value can be inferred from the symmetry of the correlation matrix.

Size Bedrooms Bathrooms Price
Size 1
Bedrooms sym 1
Bathrooms sym sym 1
Price sym sym sym 1
 Tries [_1]

g. Fit a linear model predicting the house price from Size and number of bedrooms from the house3 dataset (i.e. remove 'Bathrooms' from the model). Which of the following statements are true? (Select all that apply)
The residual standard error is almost the same as that in part (e).
R2 is almost the same as that in part (e).
The slope for 'Bedrooms' is no longer significant (i.e. its p-value > 0.05).
At least one slope is statistically significant according to the F statistic.
All slopes are significant according to the p-values.

 Tries [_1]

Remark: One of the purposes of this exercise is to show you that the regression result could be greatly affected by a few outliers. When dealing with real-life data analysis, care must be taken on the outliers. We should first investigate the origin of the outliers. Are they caused by recording mistakes, incorrect information or real data? Sometimes the outlier data may be more interesting than the rest of the data. Then we need to decide whether or not to keep the outliers, which depends largely on the goal of the analysis. If we decide to remove the outliers, we should state the reasons and document the processes in detail in the data analysis report. Usually, we will report the results with and without the outliers.