Browsing resource, all submissions are temporary.
Take a look at the 'anscombe' dataset that comes with R. The data can be loaded with the command data(anscombe). The data frame anscombe has 11 rows and 8 columns. The first 4 columns are labeled x1, x2, x3 and x4. The last 4 columns are labeled y1, y2, y3 and y4.
data(anscombe)
anscombe
a. Use the lm() function to fit a linear model predicting y1 from x1. Fill in the following table.
lm()
b. Fit 3 more linear models: (1) predicting y2 from x2, (2) predicting y3 from x3, (3) predicting y4 from x4. For each model, compare the coefficients and other quantities returned by lm() to those in the table in part (a). What do you see? The regression coefficients, t-values and p-values are different from those in the table in (a), but the residual standard error and R2 are very close. Only the regression coefficients (intercept and slope) are very close to those in the table in (a). All other values are different. All of the values are very close to those in the table in (a). All of the values are very different from those in the table in (a). The regression coefficients, t-values and p-values are very close to those in the table in (a), but the residual standard error and R2 are different.
c. Now, do what you should have done in the first place: make plots. Plot y1 versus x1 and then add the regression line on the plot. Do the same for the other 3 data sets. What do you see?
As you have learned in Stat 200, 3 main assumptions of a linear model are linearity, independence and homoscedasticity. That is to say the data points (xi, yi) can be described by the relationship yi=β0 + β1 xi + εi, with εi scatters randomly with zero mean and uniform variance indepedent of x. Which data set(s) best satisfies these assumptions? None of them since R2 ≠ 1 for all of them. All except the (x4, y4) data. Only the (x1, y1) data. Only the (x1, y1) and (x2, y2) data. All 4 data sets, since the lm() function returns essentially the same values for all quantities.