2.4 Improving Coding Efficiencies

This training module was developed by Elise Hickman, Alexis Payton, Kyle Roell, and Julia E. Rager.

All input files (script, data, and figures) can be downloaded from the UNC-SRP TAME2 GitHub website.

Introduction to Training Module

In this module, we’ll explore how to improve coding efficiency. Coding efficiency involves performing a task in as few lines as possible and can…

  • Shorten code by eliminating redundancies
  • Reduce the number of typos
  • Help other coders understand script better

Specific approaches that we will discuss in this module include loops, functions, and list operations, which can all be used to make code more succinct. A loop is employed when we want to perform a repetitive task, while a function contains a block of code organized together to perform one specific task. List operations, in which the same function is applied to a list of dataframes, can also be used to code more efficiently.

Training Module’s Environmental Health Questions

This training module was specifically developed to answer the following environmental health questions:

  1. Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI \(\geq\) 25) subjects?

  2. Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between underweight (BMI < 18.5) and non-underweight (BMI \(\geq\) 18.5) subjects?

  3. Are there statistically significant difference in drinking water arsenic, cadmium, and chromium between non-obese (BMI < 29.9) and obese (BMI \(\geq\) 29.9) subjects?

We will demonstrate how this analysis can be approached using for loops, functions, or list operations. We will introduce the syntax and structure of each approach first, followed by application of the approach to our data. First, let’s prepare the workspace and familiarize ourselves with the dataset we are going to use.

Data Import and Workspace Preparation

Installing required packages

If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you. We will be using the tidyverse package for data manipulation steps and the rstatix package for statistical tests, as it provides pipe friendly adaptations of the base R statistical tests and returns results in a dataframe rather than a list format, making results easier to access. This brings up an important aspect of coding efficiency - sometimes, there is already a package that has been designed with functions to help you execute your desired analysis in an efficient way, so you don’t need to write custom functions yourself! So, don’t forget to explore packages relevant to your analysis before spending a lot of time developing custom solutions (although, sometimes this is necessary).

if (!requireNamespace("tidyverse"))
  install.packages("tidyverse")
if (!requireNamespace("rstatix"))
  install.packages("rstatix")

Loading required packages

library(tidyverse)
library(rstatix)

Setting your working directory

setwd("/file path to where your input files are")

Importing example dataset

The first example dataset contains subject demographic data, and the second dataset contains corresponding chemical data. Familiarize yourself with these data used previously in TAME 2.0 Module 2.3 Data Manipulation and Reshaping.

# Load the demographic data
demographic_data <- read.csv("Module2_4_Input/Module2_4_InputData1.csv")

# View the top of the demographic dataset
head(demographic_data) 
##   ID  BMI     MAge MEdu       BW GA
## 1  1 27.7 22.99928    3 3180.058 34
## 2  2 26.8 30.05142    3 3210.823 43
## 3  3 33.2 28.04660    3 3311.551 40
## 4  4 30.1 34.81796    3 3266.844 32
## 5  5 37.4 42.68440    3 3664.088 35
## 6  6 33.3 24.94960    3 3328.988 40
# Load the chemical data
chemical_data <- read.csv("Module2_4_Input/Module2_4_InputData2.csv")

# View the top of the chemical dataset
head(chemical_data)
##   ID     DWAs     DWCd     DWCr       UAs       UCd      UCr
## 1  1 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2  2 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3  3 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4  4 5.906656 2.075259 50.92745  8.719123 0.9364825 42.47987
## 5  5 7.181873 2.762643 55.16882  9.436559 1.4977829 47.78528
## 6  6 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386

Preparing the example dataset

For ease of analysis, we will merge these two datasets before proceeding.

# Merging data
full_data <- inner_join(demographic_data, chemical_data, by = "ID")

# Previewing new data
head(full_data)
##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Continuous demographic variables, like BMI, are often dichotomized (or converted to a categorical variable with two categories representing higher vs. lower values) to increase statistical power in analyses. This is particularly important for clinical data that tend to have smaller sample sizes. In our initial dataframe, BMI is a continuous or numeric variable; however, our questions require us to dichotomize BMI. We can use the following code, which relies on if/else logic (see TAME 2.0 Module 2.3 Data Manipulation and Reshaping for more information) to generate a new column representing our dichotomized BMI variable for our first environmental health question.

# Adding dichotomized BMI column
full_data <- full_data %>%
  mutate(Dichotomized_BMI = ifelse(BMI < 25, "Normal", "Overweight"))

# Previewing new data
head(full_data)
##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr Dichotomized_BMI
## 1 0.7537104 42.60187       Overweight
## 2 0.9789506 41.30757       Overweight
## 3 0.1903262 36.47716       Overweight
## 4 0.9364825 42.47987       Overweight
## 5 1.4977829 47.78528       Overweight
## 6 1.6645837 38.26386       Overweight

We can see that we now have created a new column entitled Dichotomized_BMI that we can use to perform a statistical test to assess if there are differences between drinking water metals between normal and overweight subjects.


Loops

We will start with loops. There are three main types of loops in R: for, while, and repeat. We will focus on for loops in this module, but for more in-depth information on loops, including the additional types of loops, see here. Before applying loops to our data, let’s discuss how for loops work.

The basic structure of a for loop is shown here:

# Basic structure of a for loop
for (i in 1:4){
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4

for loops always start with for followed by a statement in parentheses. The argument in the parentheses tells R how to iterate (or repeat) through the code in the curly brackets. Here, we are telling R to iterate through the code in curly brackets 4 times. Each time we told R to print the value of our iterator, or i, which has a value of 1, 2, 3, and then 4. Loops can also iterate through columns in a dataset. For example, we can use a for loop to print the ages of each subject:

# Creating a smaller dataframe for our loop example
full_data_subset <- full_data[1:6, ]

# Finding the total number of rows or subjects in the dataset
number_of_rows <- length(full_data_subset$MAge)

# Creating a for loop to iterate from 1 to the last row
for (i in 1:number_of_rows){
    # Printing each subject age
    # Need to put `[i]` to index the correct value corresponding to the row we are evaluating
    print(full_data_subset$MAge[i])
}
## [1] 22.99928
## [1] 30.05142
## [1] 28.0466
## [1] 34.81796
## [1] 42.6844
## [1] 24.9496

Now that we know how a for loop works, how can we apply this approach to determine whether there are statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI \(\geq\) 25) subjects.

Because our data are normally distributed and there are two groups that we are comparing, we will use a t-test applied to each metal measured in drinking water. Testing for assumptions is outside the scope of this module, but see TAME 2.0 Module 3.3 Normality Tests and Data Transformation for more information on this topic.

Running a t-test in R is very simple, which we can demonstrate by running a t-test on the drinking water arsenic data:

# Running t-test and storing results in t_test_res
t_test_res <- full_data %>% 
  t_test(DWAs ~ Dichotomized_BMI)

# Viewing results
t_test_res
## # A tibble: 1 × 8
##   .y.   group1 group2        n1    n2 statistic    df     p
## * <chr> <chr>  <chr>      <int> <int>     <dbl> <dbl> <dbl>
## 1 DWAs  Normal Overweight    96   104    -0.728  192. 0.468

We can see that our p-value is 0.468. Because this is greater than 0.05, we cannot reject the null hypothesis that normal weight and overweight subjects are exposed to the same drinking water arsenic concentrations. Although this was a very simple line of code to run, what if we have many columns we want to run the same t-test on? We can use a for loop to iterate through these columns.

Let’s break down the steps of our for loop before executing the code.

  1. First, we will define the variables (columns) we want to run our t-test on. This is different from our approach above, because in those code chunks, we were using numbers to indicate the number of iterations through the loop. Here, we are naming the specific variables instead, and R will iterate though each of these variables. Note that we could omit this step and instead use the numeric column index of our variables of interest [7:9]. However, naming the specific columns makes this approach more robust because if additional data are added to or removed from our dataframe, the numeric column index of our variables could change. Which approach you choose really depends on the purpose of your loop!

  2. Second, we will create an empty dataframe where we will store the results generated by our for loop.

  3. Third, we will actually run our for loop. This will tell R: for each variable in our vars_of_interest vector, run a t-test with that variable (and store the results in a temporary dataframe called “res”), then add those results to our final results dataframe. A row will be added to the results dataframe each time R iterates through a new variable, resulting in a dataframe that stores the results of all of our t-tests.

# Defining variables (columns) we want to run a t-test on
vars_of_interest <- c("DWAs", "DWCd", "DWCr")

# Creating an empty dataframe to store results
t_test_res_DW <- data.frame()

# Running for loop
for (i in vars_of_interest) {
  
  # Storing the results of each iteration of the loop in a temporary results dataframe
  res <- full_data %>%
    
    # Writing the formula needed for each iteration of the loop
    t_test(as.formula(paste(i, "~ Dichotomized_BMI", sep = "")))
  
  # Adding a row to the results dataframe each time the loop is iterated
  t_test_res_DW <- bind_rows(t_test_res_DW, res)
}

# Viewing our results
t_test_res_DW
##    .y. group1     group2 n1  n2  statistic       df     p
## 1 DWAs Normal Overweight 96 104 -0.7279621 192.3363 0.468
## 2 DWCd Normal Overweight 96 104 -0.5894360 196.1147 0.556
## 3 DWCr Normal Overweight 96 104  0.1102933 197.9870 0.912

With this, we can answer Environmental Health Question #1:

Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between normal weight (BMI < 25) and overweight (BMI \(\geq\) 25) subjects?

Answer: No, there are not any statistically significant differences in drinking water metals between normal weight and overweight subjects.


Formulas and Pasting

Note the use of the code as.formula(paste0(i, "~ Dichotomized_BMI")). Let’s take a quick detour to discuss the use of the as.formula() and paste() functions, as these are important functions often used in loops and user-defined functions.

Many statistical test functions and regression functions require one argument to be a formula, which is typically formatted as y ~ x, where y is the dependent variable of interest and x is an independent variable. For some functions, additional variables can be included on the right side of the formula to represent covariates (additional variables of interest). The function as.formula() returns the argument in parentheses in formula format so that it can be correctly passed to other functions. We can demonstrate that here by assigning a dummy variable j the character string var1:

# Assigning variable
j <- "var1"

# Demonstrating output of as.formula()
as.formula(paste(j, " ~ Dichotomized_BMI", sep = ""))
## var1 ~ Dichotomized_BMI

We can use the paste() function to combine strings of characters. The paste function takes each argument (as many arguments as is needed) and pastes them together into one character string, with the separator between arguments set by the sep = argument. When our y variable is changing with each iteration of our for loop, we can use the paste() function to write our formula correctly by telling the function to paste the variable i, followed by the rest of our formula, which stays the same for each iteration of the loop. Let’s examine the output of just the paste() part of our code:

paste(j, " ~ Dichotomized_BMI", sep = "")
## [1] "var1 ~ Dichotomized_BMI"

The paste() function is very flexible and can be useful in many other settings when you need to create one character string from arguments from different sources! Notice that the output looks different from the output of as.formula(). There is a returned index ([1]), and there are quotes around the character string. The last function we will highlight here is the noquote() function, which can be helpful if you’d like a string without quotes:

noquote(paste(j, " ~ Dichotomized_BMI", sep = ""))
## [1] var1 ~ Dichotomized_BMI

However, this still returns an indexed number, so there are times when it will not allow code to execute properly (for example, when we need a formula format).

Next, we will learn about functions and apply them to our dataset to answer our additional environmental health questions.


Functions

Functions are useful when you want to execute a block of code organized together to perform one specific task, and you want to be able to change parameters for that task easily rather than having to copy and paste code over and over that largely stays the same but might have small modifications in certain arguments. The basic structure of a function is as follows:

function_name <- function(parameter_1, parameter_2...){
  
    # Function body (where the code goes)
    insert_code_here  
  
    # What the function returns
    return()
}

A function requires you to name it as we did with function_name. In parentheses, the function requires you to specify the arguments or parameters. Parameters (i.e., parameter_1) act as placeholders in the body of the function. This allows us to change the values of the parameters each time a function is called, while the majority of the code remains the same. Lastly, we have a return() statement, which specifies what object (i.e., vector, dataframe, etc.) we want to retrieve from a function. Although a function can display the last expression from the function body in the absence of a return() statement, it’s a good habit to include it as the last expression. It is important to note that, although functions can take many input parameters and execute large code chunks, they can only return one item, whether that is a value, vector, dataframe, plot, code output, or list.

When writing your own functions, it is important to describe the purpose of the function, its input, its parameters, and its output so that others can understand what your functions does and how to use it. This can be defined either in text above a code chunk if you are using R Markdown or as comments within the code itself. We’ll start with a simple function. Let’s say we want to convert temperatures from Fahrenheit to Celsius. We can write a function that takes the temperature in Fahrenheit and converts it to Celsius. Note that we have given our parameters descriptive names (fahrenheit_temperature, celsius_temperature), which makes our code more readable than if we assigned them dummy names such as x and y.

# Function to convert temperatures in Fahrenheit to Celsius
## Parameters: temperature in Fahrenheit (input)
## Output: temperature in Celsius

fahrenheit_to_celsius <- function(fahrenheit_temperature){

    celsius_temperature <- (fahrenheit_temperature - 32) * (5/9)
    
    return(celsius_temperature)
}

Notice that the above code block was run, but there isn’t an output. Rather, running the code assigns the function code to that function. When you run code defining a function, that function will appear in your Global Environment under the “Functions” section. We can see the output of the function by providing an input value. Let’s start by converting 41 degrees Fahrenheit to Celsius:

# Calling the function
# Here, 41 is the `fahrenheit_temperature` in the function
fahrenheit_to_celsius(41)
## [1] 5

41 degrees Fahrenheit is equivalent to 5 degrees Celsius. We can also have the function convert a vector of values.

# Defining vector of temperatures
vector_of_temperatures <- c(81,74,23,65)

# Calling the function
fahrenheit_to_celsius(vector_of_temperatures)
## [1] 27.22222 23.33333 -5.00000 18.33333

Before getting back to answer our environmental health related questions, let’s look at one more example of a function. This time we’ll create a function that can calculate the circumference of a circle based on its radius in inches. Here you can also see a different style of commenting to describe the function’s purpose, inputs, and outputs.

circle_circumference <- function(radius){
    # Calculating a circle's circumference based on the radius inches

    # :parameters: radius
    # :output: circumference and radius
    
    # Calculating diameter first
    diameter <- 2 * radius
    
    # Calculating circumference
    circumference <- pi * diameter
    
    return(circumference)
}

# Calling function
circle_circumference(3)
## [1] 18.84956

So, if a circle had a radius of 3 inches, its circumference would be ~19 inches. What if we were interested in seeing the diameter to double check our code?

diameter
## Error: object 'diameter' not found

R throws an error, because the variable diameter was created inside the function and the function only returned the circumference variable. This is actually one of the ways that functions can improve coding efficiency - by not needing to store intermediate variables that aren’t of interest to the main goal of the code or analysis. However, there are two ways we can still see the diameter variable:

  1. Put print statements in the body of the function (print(diameter)).
  2. Have the function return a different variable or list of variables (c(circumference, diameter)). See the below section on List Operation for more on this topic.

We can now move on to using a more complicated function to answer all three of our environmental health questions without repeating our earlier code three times. The main difference between each of our first three environmental health questions is the BMI cutoff used to dichotomize the BMI variable, so we can use that as one of the parameters for our function. We can also use arguments in our function to name our groups.

We can adapt our previous for loop code into a function that will take different BMI cutoffs and return statistical results by including parameters to define the parts of the analysis that will change with each unique question. For example:

  • Changing the BMI cutoff from a number (in our previous code) to our parameter name that specifies the cutoff
  • Changing the group names for assigning category (in our previous code) to our parameter names
# Function to dichotomize BMI into different categories and return results of t-test on drinking water metals between dichotomized groups

## Parameters: 
### input_data: dataframe containing BMI and drinking water metals levels
### bmi_cutoff: numeric value specifying the cut point for dichotomizing BMI
### lower_group_name: name for the group of subjects with BMIs lower than the cutoff
### upper_group_name: name for the group of subjects with BMIs higher than the cutoff
### variables: vector of variable names that statistical test should be run on 

## Output: dataframe with statistical results for each variable in the variables vector

bmi_DW_ttest <- function(input_data, bmi_cutoff, lower_group_name, upper_group_name, variables){
  
  # Creating dichotomized variable
  dichotomized_data <- input_data %>% 
    mutate(Dichotomized_BMI = ifelse(BMI < bmi_cutoff, lower_group_name, upper_group_name))
  
  # Creating an empty dataframe to store results
  t_test_res_DW <- data.frame()
  
  # Running for loop
  for (i in variables) {
    
    # Storing the results of each iteration of the loop in a temporary results dataframe
    res <- dichotomized_data %>%
      
    # Writing the formula needed for each iteration of the loop
    t_test(as.formula(paste(i, "~ Dichotomized_BMI", sep = "")))
    
    # Adding a row to the results dataframe each time the loop is iterated
    t_test_res_DW <- bind_rows(t_test_res_DW, res)
  }

  # Return results
  return(t_test_res_DW)
  
}

For the first example of using the function, we have included the name of each argument for clarity, but this isn’t necessary if you pass in the arguments in the order they were defined when writing the function.

# Defining variables (columns) we want to run a t-test on
vars_of_interest <- c("DWAs", "DWCd", "DWCr")

# Apply function for normal vs. overweight (bmi_cutoff = 25)
bmi_DW_ttest(input_data = full_data, bmi_cutoff = 25, lower_group_name = "Normal", 
             upper_group_name = "Overweight", variables = vars_of_interest)
##    .y. group1     group2 n1  n2  statistic       df     p
## 1 DWAs Normal Overweight 96 104 -0.7279621 192.3363 0.468
## 2 DWCd Normal Overweight 96 104 -0.5894360 196.1147 0.556
## 3 DWCr Normal Overweight 96 104  0.1102933 197.9870 0.912

Here, we can see the same results as above in the Loops section. We can next apply the function to answer our additional environmental health questions:

# Apply function for underweight vs. non-underweight (bmi_cutoff = 18.5)
bmi_DW_ttest(full_data, 18.5, "Underweight", "Non-Underweight", vars_of_interest)
##    .y.          group1      group2  n1 n2   statistic       df     p
## 1 DWAs Non-Underweight Underweight 166 34 -0.86947835 53.57143 0.388
## 2 DWCd Non-Underweight Underweight 166 34 -0.97359810 55.45450 0.334
## 3 DWCr Non-Underweight Underweight 166 34  0.04305105 56.08814 0.966
# Apply function for non-obese vs. obese (bmi_cutoff = 29.9)
bmi_DW_ttest(full_data, 29.9, "Non-Obese", "Obese", vars_of_interest)
##    .y.    group1 group2  n1 n2  statistic       df      p
## 1 DWAs Non-Obese  Obese 144 56 -1.9312097 86.80253 0.0567
## 2 DWCd Non-Obese  Obese 144 56  0.3431076 94.52209 0.7320
## 3 DWCr Non-Obese  Obese 144 56 -0.6878311 89.61818 0.4930

With this, we can answer Environmental Health Questions #2 & #3:

Are there statistically significant differences in drinking water arsenic, cadmium, and chromium between underweight (BMI < 18.5) and non-underweight (BMI \(\geq\) 18.5) subjects or between non-obese (BMI < 29.9) and obese (BMI \(\geq\) 29.9) subjects?

Answer: No, there are not any statistically significant differences in drinking water metals between underweight and non-underweight subjects or between non-obese and obese subjects.

Here, we were able to answer all three of our environmental health questions within relatively few lines of code by using a function to efficiently assess different variations on our analysis.

In the last section of this module, we will demonstrate how to use list operations to improve coding efficiency.


List operations

Lists are a data type in R that can store other data types (including lists, to make nested lists). This allows you to store multiple dataframes in one object and apply the same functions to each dataframe in the list. Lists can also be helpful for storing the results of a function if you would like to be able to access multiple outputs. For example, if we return to our example of a function that calculates the circumference of a circle, we can store both the diameter and circumference as list objects. The function will then return a list containing both of these values when called.

# Adding list element to our function
circle_circumference_4 <- function(radius){
    # Calculating a circle's circumference and diameter based on the radius in inches

    # :parameters: radius
    # :output: list that contains diameter [1] and circumference [2]
    
    # Calculating diameter first
    diameter <- 2 * radius
    
    # Calculating circumference
    circumference <- pi * diameter
    
    # Storing results in a named list
    results <- list("diameter" = diameter, "circumference" = circumference)
    
    # Return results
    results
}

# Calling function
circle_circumference_4(10)
## $diameter
## [1] 20
## 
## $circumference
## [1] 62.83185

We can also call the results individually using the following code:

# Storing results of function
circle_10 <- circle_circumference_4(10)

# Viewing only diameter

## Method 1
circle_10$diameter
## [1] 20
## Method 2
circle_10[1]
## $diameter
## [1] 20
# Viewing only circumference

## Method 1
circle_10$circumference
## [1] 62.83185
## Method 2
circle_10[2]
## $circumference
## [1] 62.83185

In the context of our dataset, we can use list operations to clean up and combine our results from all three BMI stratification approaches. This is often necessary to prepare data to share with collaborators or for supplementary tables in a manuscript. Let’s revisit our code for producing our statistical results, this time assigning our results to a dataframe rather than viewing them.

# Defining variables (columns) we want to run a t-test on
vars_of_interest <- c("DWAs", "DWCd", "DWCr")

# Normal vs. overweight (bmi_cutoff = 25)
norm_vs_overweight <- bmi_DW_ttest(input_data = full_data, bmi_cutoff = 25, lower_group_name = "Normal", 
             upper_group_name = "Overweight", variables = vars_of_interest)

# Underweight vs. non-underweight (bmi_cutoff = 18.5)
under_vs_nonunderweight <- bmi_DW_ttest(full_data, 18.5, "Underweight", "Non-Underweight", vars_of_interest)

# Non-obese vs. obese (bmi_cutoff = 29.9)
nonobese_vs_obese <- bmi_DW_ttest(full_data, 29.9, "Non-Obese", "Obese", vars_of_interest)

# Viewing one results dataframe as an example
norm_vs_overweight
##    .y. group1     group2 n1  n2  statistic       df     p
## 1 DWAs Normal Overweight 96 104 -0.7279621 192.3363 0.468
## 2 DWCd Normal Overweight 96 104 -0.5894360 196.1147 0.556
## 3 DWCr Normal Overweight 96 104  0.1102933 197.9870 0.912

For publication purposes, let’s say we want to make the following formatting changes:

  • Keep only the comparison of interest (for example Normal vs. Overweight) and the associated p-value, removing columns that are not as useful for interpreting or sharing the results
  • Rename the .y. column so that its contents are clearer
  • Collapse all of our data into one final dataframe

We can first write a function to execute these cleaning steps:

# Function to clean results dataframes

## Parameters: 
### input_data: dataframe containing results of t-test

## Output: cleaned dataframe

data_cleaning <- function(input_data) {
  
  data <- input_data %>%
    
    # Rename .y. column
    rename("Variable" = ".y.") %>%
    
    # Merge group1 and group2
    unite(Comparison, group1, group2, sep = " vs. ") %>%
    
    # Keep only columns of interest 
    select(c(Variable, Comparison, p))
  
  return(data)
}

Then, we can make a list of our dataframes to clean and apply:

# Making list of dataframes
t_test_res_list <- list(norm_vs_overweight, under_vs_nonunderweight, nonobese_vs_obese)

# Viewing list of dataframes
head(t_test_res_list)
## [[1]]
##    .y. group1     group2 n1  n2  statistic       df     p
## 1 DWAs Normal Overweight 96 104 -0.7279621 192.3363 0.468
## 2 DWCd Normal Overweight 96 104 -0.5894360 196.1147 0.556
## 3 DWCr Normal Overweight 96 104  0.1102933 197.9870 0.912
## 
## [[2]]
##    .y.          group1      group2  n1 n2   statistic       df     p
## 1 DWAs Non-Underweight Underweight 166 34 -0.86947835 53.57143 0.388
## 2 DWCd Non-Underweight Underweight 166 34 -0.97359810 55.45450 0.334
## 3 DWCr Non-Underweight Underweight 166 34  0.04305105 56.08814 0.966
## 
## [[3]]
##    .y.    group1 group2  n1 n2  statistic       df      p
## 1 DWAs Non-Obese  Obese 144 56 -1.9312097 86.80253 0.0567
## 2 DWCd Non-Obese  Obese 144 56  0.3431076 94.52209 0.7320
## 3 DWCr Non-Obese  Obese 144 56 -0.6878311 89.61818 0.4930

And we can apply the cleaning function to each of the dataframes using the lapply() function, which takes a list as the first argument and the function to apply to each list element as the second argument:

# Applying cleaning function
t_test_res_list_cleaned <- lapply(t_test_res_list, data_cleaning)

# Vieweing cleaned dataframes
head(t_test_res_list_cleaned)
## [[1]]
##   Variable            Comparison     p
## 1     DWAs Normal vs. Overweight 0.468
## 2     DWCd Normal vs. Overweight 0.556
## 3     DWCr Normal vs. Overweight 0.912
## 
## [[2]]
##   Variable                      Comparison     p
## 1     DWAs Non-Underweight vs. Underweight 0.388
## 2     DWCd Non-Underweight vs. Underweight 0.334
## 3     DWCr Non-Underweight vs. Underweight 0.966
## 
## [[3]]
##   Variable          Comparison      p
## 1     DWAs Non-Obese vs. Obese 0.0567
## 2     DWCd Non-Obese vs. Obese 0.7320
## 3     DWCr Non-Obese vs. Obese 0.4930

Last, we can collapse our list down into one dataframe using the do.call() and rbind.data.frame() functions, which together, take the elements of the list and collapse them into a dataframe by binding the rows together:

t_test_res_cleaned <- do.call(rbind.data.frame, t_test_res_list_cleaned)

# Viewing final dataframe
t_test_res_cleaned
##   Variable                      Comparison      p
## 1     DWAs           Normal vs. Overweight 0.4680
## 2     DWCd           Normal vs. Overweight 0.5560
## 3     DWCr           Normal vs. Overweight 0.9120
## 4     DWAs Non-Underweight vs. Underweight 0.3880
## 5     DWCd Non-Underweight vs. Underweight 0.3340
## 6     DWCr Non-Underweight vs. Underweight 0.9660
## 7     DWAs             Non-Obese vs. Obese 0.0567
## 8     DWCd             Non-Obese vs. Obese 0.7320
## 9     DWCr             Non-Obese vs. Obese 0.4930

The above example is just that - an example to demonstrate the mechanics of using list operations. However, there are actually a couple of even more efficient ways to execute the above cleaning steps:

  1. Build cleaning steps into the analysis function if you know you will not need to access the raw results dataframe.
  2. Bind all three dataframes together, then execute the cleaning steps.

We will demonstrate #2 below:

# Start by binding the rows of each of the results dataframes
t_test_res_cleaned_2 <- bind_rows(norm_vs_overweight, under_vs_nonunderweight, nonobese_vs_obese) %>%
  
  # Rename .y. column
    rename("Variable" = ".y.") %>%
    
  # Merge group1 and group2
  unite(Comparison, group1, group2, sep = " vs. ") %>%
    
  # Keep only columns of interest 
  select(c(Variable, Comparison, p))

# Viewing results
t_test_res_cleaned_2
##   Variable                      Comparison      p
## 1     DWAs           Normal vs. Overweight 0.4680
## 2     DWCd           Normal vs. Overweight 0.5560
## 3     DWCr           Normal vs. Overweight 0.9120
## 4     DWAs Non-Underweight vs. Underweight 0.3880
## 5     DWCd Non-Underweight vs. Underweight 0.3340
## 6     DWCr Non-Underweight vs. Underweight 0.9660
## 7     DWAs             Non-Obese vs. Obese 0.0567
## 8     DWCd             Non-Obese vs. Obese 0.7320
## 9     DWCr             Non-Obese vs. Obese 0.4930

As you can see, this dataframe is the same as the one we produced using list operations. It was produced using fewer lines of code and without the need for a user-defined function! For our purposes, this was a more efficient approach. However, we felt it was important to demonstrate the mechanics of list operations because there may be times where you do need to keep dataframes separate during specific analyses.


Concluding Remarks

This module provided an introduction to loops, functions, and list operations and demonstrated how to use them to efficiently analyze an environmentally relevant dataset. When and how you implement these approaches depends on your coding style and the goals of your analysis. Although here we were focused on statistical tests and data cleaning, these flexible approaches can be used in a variety of data analysis steps. We encourage you to implement loops, functions, and list operations in your analyses when you find the need to iterate through statistical tests, visualizations, data cleaning, or other common workflow elements!

Additional Resources


Use the same input data we used in this module to answer the following questions and produce a cleaned, publication-ready data table of results. Note that these data are normally distributed, so you can use a t-test.

  1. Are there statistically significant differences in urine metal concentrations (ie. arsenic levels, cadmium levels, etc.) between younger (MAge < 40) and older (MAge \(\geq\) 40) mothers?
  2. Are there statistically significant differences in urine metal concentrations (ie. arsenic levels, cadmium levels, etc.) between between normal weight (BMI < 25) and overweight (BMI \(\geq\) 25) subjects?