Browsing resource, all submissions are temporary.
In a typical data analysis project, it is very rare that all questions can be answered using data from a single source. It is very often that data scattered across different places need to be combined. For example, the Stat 100 course data in the previous two exercises were created by combining data from 4 different places: assignment, bonus and exam scores from two Compass 2G sites (one for the online class and the other for the in-person classes), and class registration data from two excel files (one for the online class and the other for in-person classes) downloaded from a secured university website. In addition, assignment and some of the bonus points were stored in Lon Capa and needed to be uploaded to the Compass sites every week. iClicker points were stored in a separate file and needed to be uploaded to the in-person Compass site after each class. Those processes also involved combining data from two data sets, although they were not done using R.
Multiple sets of related data are called relational data since it is the relations between the data sets that are important. Combining a pair of related data sets requires a key, which is a variable (or a set of variables) that can be used to uniquely identify an observation. For example, a student's netid (or UIN) can be used as a key to uniquely identify a student in the Stat 100 data sets. A student's name cannot be used as a key since there can be more than one student sharing the same name.
In this exercise, you are going to learn the simplest technique of dealing with relational data using dplyr: combining data from two data sets using a single-variable key. One data set is the same as the Stat 100 data set you worked on in the previous two exercises. The other is a made-up data set created just for this exercise.
dplyr
As mentioned in the tibble exercise, students in the ONL section did not attend class in person but watched the recorded lecture videos. These videos were available to the online students within a few hour after each lecture. Students in the L1 and L2 sections did not have access to the videos until one week before the three exams and Final. All of the Stat 100 lecture videos were hosted on the Illinois Media Space with links on the Compass 2G site for the ONL section (and later on the Compass site for the in-person classes). Students had to login to Compass 2G to watch the videos and Illinois Media Space tracked how much time students spent watching each video. The tracking data for each video could be downloaded. These files could then be combined using R to create a single file summarizing the percentage of videos watched by the students. In this exercise, however, you won't be asked to combine these files. Instead, we made up the combined tracking data for this exercise to keep things simple.
tibble
This exercise involves combining two data files using dplyr to study how the online student's performance on the course is correlated with the time they spent watching the videos. One file is the Stat 100 grade data, which you can download here and then load to R using the following commands.
library(tidyverse) tib <- read_csv("stat100.csv")
The second file is a made-up data summarizing the percentage of lecture videos watched by the students. It can be downloaded here and then loaded to R using the following command.
video <- read_csv("videos_watched097.csv") %>% rename(videos_watched = `% of videos watched`)
This csv file has only two columns. The first column is the student's netid; the second column is the percentage of lecture videos watched by the student. Note that the name of the second column is changed to "videos_watched" from the original name "% of videos watched" by the rename() command.
rename()
Before attempting the following questions, make sure you have gone through the previous dplyr exercise to create a new tibble named stat100 with all the new columns in that exercise. If you did not store the tibble stat100, that's great because this is a good chance for you to practice the dplyr commands by re-creating the stat100 tibble. The following lists the things you have to do to re-create stat100.
stat100
tib
required_tot
total
exam_avg
Class
Try to do all these operations in one go using the pipe operator (%>%), and type "show answer" in the following box to see the answer. Also try to do these without peeking at the previous exercise, using only ? to find the syntax of the commands you need. I consider myself as a slow learner. In general, when I first learn a set of commands, with a little practice I could do this kind of exercise without peeking usually in the next day, but I wouldn't be able to do it without peeking after a few days without practicing the commands. I would need about a dozen of practice sections spread out over a few weeks to be able to consolidate the knowledge to a long-term memory.
Enter "show answer" to see the answer.
The stat100 tibble should contain 1510 observations and 66 columns. The video tibble, on the other hand, has less observations. This is because most students in the in-person class didn't watch any one of the videos and so wouldn't appear in the video tibble. We first want to add the percentage of videos watched by the students to a new column in the stat100 tibble. This can be done using the following dplyr command.
video
combined <- left_join(stat100, video, by="netid")
The function left_join(x,y, by="key_column") returns a data frame (tibble) containing all rows from x and all columns in x and y using the column named "key_column" as the key to identify the observations. Rows in x with no match in y will have NA values in the new columns. Rows in y but not in x are excluded. If you type ?left_join, you will see a number of other related functions that are used to combine two data frames by various different methods. Read Chapter 13 of R for Data Science if you are interesed to learn the detail of these different methods.
left_join(x,y, by="key_column")
?left_join
The new tibble combined should have 67 columns, and you should see a number of NAs in the new column "videos_watched". The function left_join() is used because we only want to keep the students in stat100 and just add a new column.
combined
left_join()
Consider the following 4 tibbles obtained by combining stat100 and video:
combined1 <- left_join(video, stat100, by="netid") combined2 <- right_join(video, stat100, by="netid") combined3 <- right_join(stat100, video, by="netid") combined4 <- inner_join(stat100, video, by="netid") combined5 <- full_join(stat100, video, by="netid")
Which of them contain identical data as the tibble combined? By identical we mean they have exactly the same rows and columns but the rows and columns may scramble in a different order. To find out, we can use the command all.equal(). For example, all.equal(combined, combined1).
all.equal()
all.equal(combined, combined1)
There are observations in video that are not in stat100. They correspond to students who dropped the course. We can create a new tibble containing only these dropped students using the following dplyr command.
video_dropped <- anti_join(video, stat100, by="netid")
The dplyr function anti_join(x,y) returns all rows from x where there are not matching values in y, keeping just columns from x. So video_dropped contains the same two columns as video, but with students found in stat100 removed. There is a related function called semi_join(x,y), which returns all rows from x where there are matching values in y, keeping just columns from x.
anti_join(x,y)
video_dropped
semi_join(x,y)
In the following questions you will study how students watched the lecture videos. Since the video data are made-up, don't pay attention to the numbers. Focus on the methods to get the numbers instead.
Number of dropped students in the tibble video =
Average percentage of videos watched by these students = % (round to two decimal places).
Number of in-person students who watched the videos = .
Average percentage of videos watched by the in-person students (only count those who watched > 0%) = % (round to 2 decimal palces).
videos_watched
Number of online students who watched 100% of videos = .
Minimum value of videos_watched for online students = .
Mean of videos_watched for online students = (round to 2 decimal palces).
Divide the online students into three groups: 1) those who watched more than 80% of lecture videos, 2) those who watched 50%-80% of videos, and 3) those who watched 50% or less. We can use the cut() function to divide the students into the three groups. You've seen the cut() function being used in Section 13.10 of Peng's textbook. You will see cut() again in Weeks 13 and 14. The function turns a numeric vector into a factor according to intervals specified by the break points in the function call. For example, cut(1:5, c(0,2,6)) returns a factor vector consisting of (0,2], (0,2], (2,6], (2,6] and (2,6]. The factor has two levels (0,2] and (2,6], created according to the break points c(0,2,6) specified in function call. Since 1, 2, 3, 4, 5 are in the intervals (0,2], (0,2], (2,6], (2,6] and (2,6] respectively, the returned factor vector contains these values. So cut(videos_watched, c(-Inf, 50, 80, Inf)) will turn videos_watched into a factor with 3 levels corresponding to the 3 groups: (-∞, 50], (50,80] and (80,∞].
cut()
cut(1:5, c(0,2,6))
c(0,2,6)
cut(videos_watched, c(-Inf, 50, 80, Inf))
HW_avg