1. LON-CAPA Logo
  2. Help
  3. Log In
 

Browsing resource, all submissions are temporary.


Relational Data

In a typical data analysis project, it is very rare that all questions can be answered using data from a single source. It is very often that data scattered across different places need to be combined. For example, the Stat 100 course data in the previous two exercises were created by combining data from 4 different places: assignment, bonus and exam scores from two Compass 2G sites (one for the online class and the other for the in-person classes), and class registration data from two excel files (one for the online class and the other for in-person classes) downloaded from a secured university website. In addition, assignment and some of the bonus points were stored in Lon Capa and needed to be uploaded to the Compass sites every week. iClicker points were stored in a separate file and needed to be uploaded to the in-person Compass site after each class. Those processes also involved combining data from two data sets, although they were not done using R.

Multiple sets of related data are called relational data since it is the relations between the data sets that are important. Combining a pair of related data sets requires a key, which is a variable (or a set of variables) that can be used to uniquely identify an observation. For example, a student's netid (or UIN) can be used as a key to uniquely identify a student in the Stat 100 data sets. A student's name cannot be used as a key since there can be more than one student sharing the same name.

In this exercise, you are going to learn the simplest technique of dealing with relational data using dplyr: combining data from two data sets using a single-variable key. One data set is the same as the Stat 100 data set you worked on in the previous two exercises. The other is a made-up data set created just for this exercise.

As mentioned in the tibble exercise, students in the ONL section did not attend class in person but watched the recorded lecture videos. These videos were available to the online students within a few hour after each lecture. Students in the L1 and L2 sections did not have access to the videos until one week before the three exams and Final. All of the Stat 100 lecture videos were hosted on the Illinois Media Space with links on the Compass 2G site for the ONL section (and later on the Compass site for the in-person classes). Students had to login to Compass 2G to watch the videos and Illinois Media Space tracked how much time students spent watching each video. The tracking data for each video could be downloaded. These files could then be combined using R to create a single file summarizing the percentage of videos watched by the students. In this exercise, however, you won't be asked to combine these files. Instead, we made up the combined tracking data for this exercise to keep things simple.

This exercise involves combining two data files using dplyr to study how the online student's performance on the course is correlated with the time they spent watching the videos. One file is the Stat 100 grade data, which you can download here and then load to R using the following commands.

library(tidyverse)
tib <- read_csv("stat100.csv")

The second file is a made-up data summarizing the percentage of lecture videos watched by the students. It can be downloaded here and then loaded to R using the following command.

video <- read_csv("videos_watched097.csv") %>% 
       rename(videos_watched = `% of videos watched`)

This csv file has only two columns. The first column is the student's netid; the second column is the percentage of lecture videos watched by the student. Note that the name of the second column is changed to "videos_watched" from the original name "% of videos watched" by the rename() command.

Before attempting the following questions, make sure you have gone through the previous dplyr exercise to create a new tibble named stat100 with all the new columns in that exercise. If you did not store the tibble stat100, that's great because this is a good chance for you to practice the dplyr commands by re-creating the stat100 tibble. The following lists the things you have to do to re-create stat100.

  1. Create a new tibble stat100 by renaming `Exam 1` to Exam1, `Exam 2` to Exam2, `Exam 3` to Exam3 from the tibble tib.
  2. Create two new columns in the stat100 tibble corresponding to the weighted total of the required work and the course total using the following formulas.
    weighted total of the required work = 0.2*(Exam1 + Exam2 + Exam3) + 0.15*HW_avg + 0.25*Final
    course total = 100*(weighted total of the required work + bonus_tot*0.25) / (100 + bonus_tot*0.25)
    I used required_tot to name the column for the weighted total of the required work, and total to name the column for the course total in the previous exercise.
  3. Create a new column named exam_avg that stores the weighted exam average according to the formula
    exam_avg = ( 20*(Exam1+Exam2+Exam3)+ 25*Final )/85
  4. Create a new column named Class in the stat100 tibble. Set Class to "online" if Section is "ONL" and set Class to "in-person" if Section is "L1" or "L2".

Try to do all these operations in one go using the pipe operator (%>%), and type "show answer" in the following box to see the answer. Also try to do these without peeking at the previous exercise, using only ? to find the syntax of the commands you need. I consider myself as a slow learner. In general, when I first learn a set of commands, with a little practice I could do this kind of exercise without peeking usually in the next day, but I wouldn't be able to do it without peeking after a few days without practicing the commands. I would need about a dozen of practice sections spread out over a few weeks to be able to consolidate the knowledge to a long-term memory.

Enter "show answer" to see the answer.

 Tries 0/2

The stat100 tibble should contain 1510 observations and 66 columns. The video tibble, on the other hand, has less observations. This is because most students in the in-person class didn't watch any one of the videos and so wouldn't appear in the video tibble. We first want to add the percentage of videos watched by the students to a new column in the stat100 tibble. This can be done using the following dplyr command.

combined <- left_join(stat100, video, by="netid")

The function left_join(x,y, by="key_column") returns a data frame (tibble) containing all rows from x and all columns in x and y using the column named "key_column" as the key to identify the observations. Rows in x with no match in y will have NA values in the new columns. Rows in y but not in x are excluded. If you type ?left_join, you will see a number of other related functions that are used to combine two data frames by various different methods. Read Chapter 13 of R for Data Science if you are interesed to learn the detail of these different methods.

The new tibble combined should have 67 columns, and you should see a number of NAs in the new column "videos_watched". The function left_join() is used because we only want to keep the students in stat100 and just add a new column.

Consider the following 4 tibbles obtained by combining stat100 and video:

combined1 <- left_join(video, stat100, by="netid")
combined2 <- right_join(video, stat100, by="netid")
combined3 <- right_join(stat100, video, by="netid")
combined4 <- inner_join(stat100, video, by="netid")
combined5 <- full_join(stat100, video, by="netid")

Which of them contain identical data as the tibble combined? By identical we mean they have exactly the same rows and columns but the rows and columns may scramble in a different order. To find out, we can use the command all.equal(). For example, all.equal(combined, combined1).

  1. Use all.equal() to determine which of the following tibbles contain identical data as the tibble combined. (Select all that apply)
    combined1
    combined2
    combined3
    combined4
    combined5
  2.  Tries 0/3

    There are observations in video that are not in stat100. They correspond to students who dropped the course. We can create a new tibble containing only these dropped students using the following dplyr command.

    video_dropped <- anti_join(video, stat100, by="netid")

    The dplyr function anti_join(x,y) returns all rows from x where there are not matching values in y, keeping just columns from x. So video_dropped contains the same two columns as video, but with students found in stat100 removed. There is a related function called semi_join(x,y), which returns all rows from x where there are matching values in y, keeping just columns from x.

    In the following questions you will study how students watched the lecture videos. Since the video data are made-up, don't pay attention to the numbers. Focus on the methods to get the numbers instead.

  3. Use the tibble video_dropped to calculate the number of dropped students in the tibble video and the average percentage of lecture videos watched by these students.
  4. Number of dropped students in the tibble video =

     Tries 0/3

    Average percentage of videos watched by these students = % (round to two decimal places).

     Tries 0/5
  5. Since the videos were only available to students in the in-person classes one week before each Exam and Final, most of them didn't watch the videos. How many in-person students watched the videos? Of those who watched the videos, what was the average percentage of videos watched by them? Note that we only count the students who watched more than 0% of the videos. There could be students in the tibble video but with videos_watched=0.
  6. Number of in-person students who watched the videos = .

     Tries 0/3

    Average percentage of videos watched by the in-person students (only count those who watched > 0%) = % (round to 2 decimal palces).

     Tries 0/5

  7. Now let's look at the online students. How many of them watched 100% of the videos? What is the minimum value of videos_watched for the online students? What is the mean of videos_watched for the online students?
  8. Number of online students who watched 100% of videos = .

     Tries 0/3

    Minimum value of videos_watched for online students = .

     Tries 0/5

    Mean of videos_watched for online students = (round to 2 decimal palces).

     Tries 0/5

    Divide the online students into three groups: 1) those who watched more than 80% of lecture videos, 2) those who watched 50%-80% of videos, and 3) those who watched 50% or less. We can use the cut() function to divide the students into the three groups. You've seen the cut() function being used in Section 13.10 of Peng's textbook. You will see cut() again in Weeks 13 and 14. The function turns a numeric vector into a factor according to intervals specified by the break points in the function call. For example, cut(1:5, c(0,2,6)) returns a factor vector consisting of (0,2], (0,2], (2,6], (2,6] and (2,6]. The factor has two levels (0,2] and (2,6], created according to the break points c(0,2,6) specified in function call. Since 1, 2, 3, 4, 5 are in the intervals (0,2], (0,2], (2,6], (2,6] and (2,6] respectively, the returned factor vector contains these values. So cut(videos_watched, c(-Inf, 50, 80, Inf)) will turn videos_watched into a factor with 3 levels corresponding to the 3 groups: (-∞, 50], (50,80] and (80,∞].

  9. (10 points) Calculate the number of students, the mean of HW_avg, the mean of exam_avg, and the mean of course total in each of the three groups of online students. Round the means to 2 decimal places.

  10. % of videos watched Num. of students mean(HW_avg) mean(exam_avg) mean(course total)
    < 50%
    50% – 80%
    > 80%
     Tries 0/5