1. LON-CAPA Logo
  2. Help
  3. Log In
 

Introduction

In this programming exercise, you will learn how to use R to process data by writing an R code. If this is the first time you ever write a computer code, don't panic! It won't be difficult if you read through this notes. In this exercise, you only need to write a few lines of R code.

This exercise was inspired by a real-life problem. Fall 2015 semester was the first time all of the Stat 200 Homework assignments were put on Lon Capa. At the end of the semester, the overall HW grade for each student had to be calculated according to the grading scheme described in the course syllabus. Unfortunately, the grading system on Lon Capa is quite complicated. The simplest solution was to download the HW grades from Lon Capa, do the calculation using software like Excel and then upload the result to Compass 2G. It sounds easy, right?

Here is a csv file created from a data file downloaded from Lon Capa. The file contains 22 HW scores from Fall 2015 semester's Stat 200 (there was no 'Mon HW 4' and 'Mon HW 9' was not graded for credits). The usernames have been altered to protect privacy. In addition, scores from each HW set have been randomly shuffled so each observation does not correspond to the actual HW scores of any Stat 200 student. The file contains 45 columns. The first column is student's (fake) username. The second column contains the (fake) scores of the first HW; the third column is the maximum score of the first HW; the fourth column is the (fake) scores of the second HW; the fifth column is the maximum score of the second HW, … and so on.

The first task is to convert each HW score to percentage by dividing the raw score by the maximum score and multiplying 100. The second task is to drop the lowest two HW percents and then calculate the average HW percent. That is, for each student add up the highest 20 HW percents and then divide by 20. The third task is to calculate the Bonus HW points: each dropped HW is worth 10 possible points. If students got 100% on their lowest two HW's, they got 20 pts. If they got 60% on one and 70% on the other, they'd get 13 pts. In other words, it's the sum of the their lowest two HW percents, divided by 10.

How did we accomplish these tasks? The first idea was to use Excel, but we found it incredibly tedious to do these calculations with Excel. Since we would be doing an R independence study the following semester, we tried using R. It turns out that we could do the calculations with about 20 lines of R code.

The Exercise

The programming exercise is to write an R code to

  1. read the HW score data from this csv file,
  2. calculate the average HW percentage for each student with lowest two HW percents dropped,
  3. calculate the Bonus HW points,
  4. round all scores and percents to 2 decimal places (make sure you do the rounding after all calculations have been done), and
  5. output the HW's percentage scores, average HW percentages after dropping the lowest two percents, and the Bonus HW points to a new csv file.

The csv file you generate should have the first row as header. It should contain 25 columns: column 1 is student's username, columns 2-23 are the percentages of the 22 HW, column 24 is the average HW percent, and column 25 is the Bonus HW points (this csv file could be uploaded to compass 2G). You don't need to submit your code nor the csv file. In the next Lon Capa problem, you will be asked to use your code to do calculations.

For some people, this might seem daunting at first. In this exercise, I will guide you through the process. I will demonstrate below how to calculate the HW average without dropping any HW scores. You can finish the rest of the calculations based on the code and template I provide below.

Calculate HW Average Without Dropping Scores

Before starting to write a code, I first have to decide if I want it to solve the problem I have at hand or if I want to write a more general code that can handle problems I will encounter again in the future. For this particular case, I want the code to be more general. I want it to be able to apply to the HW data in the next semester's Stat 200. As a result, I don't want to hardwire certain parameters that are specific to the data I have. For example, in the next semester, the total number of HW may not be 22. So I don't want to hardwire the number of HW to be 22 in my code, but instead set it according to the number of columns in the csv file.

It is convenient to set parameters you might change later in the first few lines of the code since you will know exactly where to find the parameters. In this particular case, there are only two parameters: the input file name and the output file name. When you first write the code, you probably don't know exactly what the free parameters will be, that's OK as you can always add them later as you go along. So the first few lines of my code are

# Set input and output filenames
inputfile <- 'Stat200-fake_hw_scores.csv'
outputfile <- 'Stat200-fake_hw_average.csv'

Now download the csv file to your R's working directory. Load the data to R using the read.csv() command:

# Load data from inputfile
hw <- read.csv(inputfile,check.name=FALSE)

I use the check.name=FALSE option to preserve the column names in the csv files. Next let's check for missing values:

sum(is.na(hw))
[1] 263

This shows that there are 263 missing values in the data frame. Using View(hw) to take a look at the data frame and comparing it with the csv file, I find that the NAs correspond to the blank fields. What happens is that for columns containing numerical values, R turns blank fields to NAs. A blank field arises when a student did not submit a HW before the deadline. So all the blank fields should be set to 0. This can be done using the command

# Set blank fields to 0 instead of NAs
hw[is.na(hw)] <- 0

If you forget how this command works, review last week's Lon Capa problem on Logic and Subsetting, part (c).

Let's check that all NAs have been removed:

sum(is.na(hw))
[1] 0

Next let's determine the number of HW based on the number of columns in the csv file. The first column is username; columns 2 and 3 are the first HW's data; columns 4 and 5 are second HW's data and so on. If there are n HW, the number of columns is 2n+1. It follows that number of HW = (number of columns -1)/2. Let's store it to a variable called n_hw. This is done by

# Determine the number of HW from the number of columns
n_hw <- (ncol(hw)-1)/2

Let's see if I get it right:

n_hw
[1] 22

The next step is to convert the HW scores to percentages. I want to create new columns to the data frame to store these numbers, and I want to set the names for these new columns based on the names of the HW. Let's first extract the name of each HW:

# Extract hw names from columns 2, 4, 6, ..., 2*n_hw
hw_names <- names(hw)[seq(2,2*n_hw,2)]

hw_names
 [1] "Mon HW 1"                    "Wed HW 1"                   
 [3] "Mon HW 2"                    "Wed HW 2"                   
 [5] "Mon HW 3"                    "Wed HW 3"                   
 [7] "Wed HW 4"                    "Mon HW 5"                   
 [9] "Wed HW 5"                    "Mon HW 6"                   
[11] "Wed HW 6"                    "Mon HW 7"                   
[13] "Wed HW 7"                    "Mon HW 8"                   
[15] "Wed HW 8"                    "Wed HW 9"                   
[17] "Mon HW 10"                   "Wed HW 10"                  
[19] "Mon HW 11"                   "Wed HW 11 "                 
[21] "Mon HW 12 (Mon after break)" "Wed HW 12"                  

The new names could be the HW names with the string '[Total Pts: 100]' added to the end to indicate that the scores represent percentage. This can be done by the paste() function. If you forget the paste() function, revisit Lesson 4 of the R programming in swirl or type ?paste.

With the names of the new columns in mind, let's convert the HW scores to percentages: loop over each HW, divide the score by the maximum and multiply by 100. The ith HW score is at column 2i and the ith maximum HW score is at column 2i+1. Thus the commands are

# Calculate % of hw scores
for (i in 1:n_hw) {
  newcol_name <- paste(hw_names[i],'[Total Pts: 100]')
  hw[[newcol_name]] <- hw[,2*i]/hw[,2*i+1]*100
}

There should be 22 columns added to the data frame, so the total number of columns should be 45 + 22 = 67. Let's see if this is correct:

ncol(hw)
[1] 67

I am now almost done. In this demonstration, I simply calculate the average HW percent without dropping any score. So I take the mean of the HW percents I just computed above. These percent scores are appended to the last n_hw columns of the data frame, from column n1=2*n_hw+2 to column n2=3*n_hw+1. In the following code chuck, the average of these columns is computed in each row and the result is stored to a new column named 'HW Average':

# Calculate HW average
n1 <- 2*n_hw+2   # col index of the first HW % scores
n2 <- 3*n_hw+1   # col index of the last HW % scores
for (i in 1:nrow(hw)) {
  hw[["HW Average"]][i] <- sum(hw[i,n1:n2])/n_hw
}

In your programming exercise, you will have to modify the above code chunk to drop the lowest 2 HW percents before taking the average, and also compute the Bonus HW points. The easiest way is to first use the sort() function to sort the HW percents for each student. The lowest two HW scores can then be extracted from the sorted scores and be used to perform the required calculations. It should only take extra few lines to compute the HW average and Bonus points.

Before outputting the scores to a file, I want to round them to 2 decimal places. Note that the rounding is done in the very last step, after all calculations have been done. I only need to output the last n_hw+1 columns, from column n1 to column n2+1. So I round the numbers in these columns:

# Round scores to 2 decimal places
hw[,n1:(n2+1)] <- round(hw[,n1:(n2+1)],2)

The final step is to export the first column (username) and the last n_hw+1 columns to a csv file:

# Output to csv file
write.csv(hw[,c(1,n1:(n2+1))], outputfile, row.names=FALSE)

In your code, you will have to calculate the Bonus HW points in addition to the HW average, so you will have to output columns c(1, n1:(n2+2)) to the csv file.

The Complete Code

Let's gather all the commands needed for the calculations in the following code chunk:

# Set input and output filenames
inputfile <- 'Stat200-fake_hw_scores.csv'
outputfile <- 'Stat200-fake_hw_average.csv'

# Load data to a data frame called hw
hw <- read.csv(inputfile,check.name=FALSE)

# Set blank fields to 0 instead of NAs
hw[is.na(hw)] <- 0

# Determine the number of HW from the number of columns
n_hw <- (ncol(hw)-1)/2

# Extract hw names from columns 2, 4, 6, ..., 2*n_hw
hw_names <- names(hw)[seq(2,2*n_hw,2)]

# Calculate % of hw scores
for (i in 1:n_hw) {
  newcol_name <- paste(hw_names[i],'[Total Pts: 100]')
  hw[[newcol_name]] <- hw[,2*i]/hw[,2*i+1]*100
}

# Calculate HW average
n1 <- 2*n_hw+2 # col index of the first HW % scores
n2 <- 3*n_hw+1 # col index of the last HW % scores
for (i in 1:nrow(hw)) {
  hw[["HW Average"]][i] <- sum(hw[i,n1:n2])/n_hw
}

# Round scores to 2 decimal places
hw[,n1:(n2+1)] <- round(hw[,n1:(n2+1)],2)

# Output to csv file
write.csv(hw[,c(1,n1:(n2+1))], outputfile, row.names=FALSE)

There are 17 lines of R code in the above code chunk, excluding comments and blank lines. The blank lines are added to help visualize the structure of the code. Comments are extremely useful for everyone, the programmer as well as for other people who would like to use the code. I might want to use the code and modify it to do some other calculations in the future. By then I will have forgotten what I have done here. If there are no comments, I will have to figure out what the code does line by line. For this particular code, it won't be difficult since there are only 17 lines. But you can imagine how much time you have to spend for a code with thousands of lines. It is therefore a good habit to include detail comments in your codes, especially for codes you are going to use and modify later.

Since I am going to use the code again, I save it to a text file named 'calculate_hw_average_demo.R'. The next time I want to run the code again, I simply put the file to my R's working directory and type

source('calculate_hw_average_demo.R')

R will execute every line in the file. In fact, when I am developing an R code, especially a longer code, I always open an R script editor in RStudio to add lines to the code, so that I don't have to trace back to gather all the useful commands I type in the R console. (Watch this this youtube video on how to use an R script editor)

Simplicity of Vectorized Operations

One of the reasons that R can perform complicated operations with simple commands is due to its support of vectorized operations. Vectorized operations are also supported by some other computer languages, e.g. Matlab, and can greatly simply the coding. For example, without vectorized operations, the command hw[is.na(hw)] <- 0 would have to be changed to two nested for-loops over the entire data frame:

for (i in 1:nrow(hw)) {
  for (j in 1:ncol(hw)) {
    if (is.na(hw[i,j])) {
      hw[i,j] <- 0
    }
  }
}

The command hw[,n1:(n2+1)] <- round(hw[,n1:(n2+1)],2) would have to be replaced by

for (i in 1:nrow(hw)) {
  for (j in n1:(n2+1)) {
    hw[i,j] <- round(hw[i,j],2)
  }
}

The HW average calculation would become

for (i in 1:nrow(hw)) {
  sum_scores <- 0
  for (j in n1:n2) {
    sum_scores <- sum_scores + hw[i,j]
  }
  hw[["HW Average"]][i] <- sum_scores/n_hw
}

Actually, I have not taken full advantage of vectorized operations in my calculation of HW average. For example, instead of looping over rows, I could have simply used the one-line command

hw[["HW Average"]] <- rowMeans(hw[,n1:n2])

to calculate the average (type ?rowMeans to look at its usage). The reason why I didn't do that is to set it up so that it can be easily modified to incorporate dropping the lowest 2 HW scores. It turns out that if we want to drop the lowest 2 HW scores before taking the average and to compute the Bonus HW points, it is easier to loop over the rows.

The bottom line is that you should try to vectorize your calculations as much as possible. It not only makes the code cleaner and easier to read, but also reduces the chance of making mistakes. It will also make your code run faster since some of the vectorized operations are highly optimized for heavy computations. For this particular exercise, speed is not an issue since the data size is small. But it could make a big difference if you are dealing with big data.

Programming Exercise Template and Code Validation

Here is a template for this programming exercise:

# Set input and output filenames
inputfile <- 'Stat200-fake_hw_scores.csv'
outputfile <- 'Stat200-fake_hw_grades.csv'

# Load data to a data frame called hw
hw <- read.csv(inputfile,check.name=FALSE)

# Set blank fields to 0 instead of NAs
hw[is.na(hw)] <- 0

# Determine the number of HW from the number of columns
n_hw <- (ncol(hw)-1)/2

# Extract hw names from columns 2, 4, 6, ..., 2*n_hw
hw_names <- names(hw)[seq(2,2*n_hw,2)]

# Calculate % of hw scores
for (i in 1:n_hw) {
  newcol_name <- paste(hw_names[i],'[Total Pts: 100]')
  hw[[newcol_name]] <- hw[,2*i]/hw[,2*i+1]*100
}

# Calculate HW average after dropping lowest 2 percent scores,
# and calculate the Bonus HW points by summing the dropped HW %
# then divide by 10
n1 <- 2*n_hw+2 # col index of the first HW % scores
n2 <- 3*n_hw+1 # col index of the last HW % scores
for (i in 1:nrow(hw)) {
  ##### Fill in calculations below. #####
  ##### It should be a few lines only. #####

  #########################
}

# Round scores to 2 decimal places
n3 <- n2+2   # 2 cols added: HW avg & Bonus HW points
hw[,n1:n3] <- round(hw[,n1:n3],2)

# Output to csv file
write.csv(hw[,c(1,n1:n3)], outputfile, row.names=FALSE)

Go through the code and fill in the require calculations. You should only need to write a few lines. Although you are not required to use the template for the exercise, we strongly suggest you use it. It is a good exercise. Pretend that you are working in a company, and you are asked to modify an existing code to perform a new analysis.

Tutorial Video: Confused? We made a tutorial video explaining the logic. If you still have questions after watching the video, stop by 23 Illini Hall during our office hours.

After you finished the coding, save it and put it in your R's working directory. Then type source('filename') to execute the code. The next step is to validate it. In my opinion, code validation is as important as, if not more important than, writing the code itself. I never trust any code I write without testing it to make sure there are no mistakes. Silly mistakes happen all the time, even in codes written by experienced programmers. Mistakes in a code can have serious consequences depending what the code is used for. We are entering the era of “Big Data”, where more and more researches rely on using complex codes to perform sophisticated data analysis. It becomes increasingly important to emphasize the importance of code validation.

There are several methods to test a code. In Week 2, you learned that there are at least 7 methods to check your calculations. You have learned the sanity check, consistency check and comparison with independently written codes. They are especially useful to check obvious mistakes. For this particular code, what I did to validate my code was to randomly choose a student and did the calculations using a calculator and compared my results to those generated by the code. This is the strategy "comparison with different methods". Since this is the first programming exercise and for some people this may be your first programming experience, I am going to make it very easy for you. I am telling you that the HW average (after dropping the lowest 2 scores) for the first 5 students are 95.99, 94.57, 93.51, 86.03, 97.20. The Bonus HW points for the first 5 students are 0, 8.99, 0, 0, 8.33. The maximum HW average of these 185 students is 99.35 and the minimum is 77.82. The maximum HW Bonus point is 18.56.

The easiest way to check these numbers is to run your code using the source() command. Ignore the csv file generated by your code since it is easier to look at the data frame hw directly. If you follow the template, hw should have 69 columns after the code is executed. The 68th column is the HW average and the 69th column is the Bonus HW points. To see the HW average of the first five students, type hw[1:5,68]. To see the Bonus HW points for the first 5 students, type hw[1:5,69]. To find the maximum of the HW average, use the command max(hw[,68]), … and so on. Make sure your code produces the same numbers given above. After your code passes this test, you can go to the next Lon Capa problem.

Legacy

We thought we'd use this code in the future for Stat 200 HW grade calculation. It turns out that this code has only been used once and it's likely that it won't be used ever again. In the subsequent semesters, we downloaded the Stat 200 HW grades from Lon Capa and then uploaded them to Campass 2G every week. We'd manually enter the total points of each HW on Compass and it would convert the scores to percentages. All the subsequent calculations can be done easily on Campass. This is better for students as they can look at their HW grades more easily. It takes much more time in total, but since we're doing it week by week it doesn't feel tedious each time.

The legacy of this code is this programming exercise.