2.3 Data Manipulation and Reshaping

This training module was developed by Kyle Roell, Alexis Payton, Elise Hickman, and Julia E. Rager.

All input files (script, data, and figures) can be downloaded from the UNC-SRP TAME2 GitHub website.

Introduction to Training Module

Data within the fields of exposure science, toxicology, and public health are very rarely prepared and ready for downstream statistical analyses and visualization code. The beginning of almost any scripted analysis includes important formatting steps that make the data easier to read and work with. This can be done in several ways, including:

Base R operations and functions, or
A collection of packages (and philosophy) known as The Tidyverse.

In this training tutorial we will review some of the most common ways you can organize and manipulate data, including:

Merging data
Filtering and subsetting data
Pivoting data wider and longer (also known as casting and melting)

These approaches will first be demonstrated using the functions available in base R. Then, the exact same approaches will be demonstrated using the functions and syntax that are part of the Tidyverse package.

We will demonstrate these data manipulation and organization methods using an environmentally relevant example data set from a human cohort. This dataset was generated by creating data distributions randomly pulled from our previously published cohorts, resulting in a unique data set for these training purposes. The dataset contains environmental exposure metrics from metal levels obtained using sources of drinking water and human urine samples and associated demographic data.

Training Module’s Environmental Health Question

This training module was specifically developed to answer the following environmental health question using data manipulation and reshaping approaches:

What is the average urinary chromium concentration across different maternal education levels?

We’ll use base R and Tidydverse to answer this question, but let’s start with Base R.

Workspace Preparation and Data Import

Set your working directory

In preparation, first let’s set our working directory to the folder path that contains our input files:

setwd("/file path to where your input files are")

Note that macOS file paths use / as folder separators, and PC file paths use \.

Importing example datasets

Next, let’s read in our example data sets:

demographic_data <- read.csv("Module2_3_Input/Module2_3_InputData1.csv")
chemical_data <- read.csv("Module2_3_Input/Module2_3_InputData2.csv")

Viewing example datasets

Let’s see what these datasets look like:

dim(demographic_data)

## [1] 200   6

dim(chemical_data)

## [1] 200   7

The demographic data set includes 200 rows x 7 columns, while the chemical measurement data set includes 200 rows x 7 columns.

We can preview the demographic data frame by using the head() function, which displays all the columns and the first 6 rows of a data frame:

head(demographic_data)

##   ID  BMI     MAge MEdu       BW GA
## 1  1 27.7 22.99928    3 3180.058 34
## 2  2 26.8 30.05142    3 3210.823 43
## 3  3 33.2 28.04660    3 3311.551 40
## 4  4 30.1 34.81796    3 3266.844 32
## 5  5 37.4 42.68440    3 3664.088 35
## 6  6 33.3 24.94960    3 3328.988 40

These demographic data are organized according to subject ID (first column) followed by the following subject information:

ID: subject number
BMI: body mass index
MAge: maternal age in years
MEdu: maternal education level; 1 = “less than high school”, 2 = “high school or some college”, 3 = “college or greater”
BW: body weight in grams
GA: gestational age in weeks

We can also preview the chemical dataframe:

head(chemical_data)

##   ID     DWAs     DWCd     DWCr       UAs       UCd      UCr
## 1  1 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2  2 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3  3 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4  4 5.906656 2.075259 50.92745  8.719123 0.9364825 42.47987
## 5  5 7.181873 2.762643 55.16882  9.436559 1.4977829 47.78528
## 6  6 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386

These chemical data are organized according to subject ID (first column), followed by measures of:

DWAs: drinking water arsenic levels in µg/L
DWCd: drinking water cadmium levels in µg/L
DWCr: drinking water chromium levels in µg/L
UAs: urinary arsenic levels in µg/L
UCd: urinary cadmium levels in µg/L
UCr: urinary chromium levels in µg/L

Data Manipulation Using Base R

Merging Data Using Base R Syntax

Merging datasets represents the joining together of two or more datasets, using a common identifier (generally some sort of ID) to connect the rows. This is useful if you have multiple datasets describing different aspects of the study, different variables, or different measures across the same samples. Samples could correspond to the same study participants, animals, cell culture samples, environmental media samples, etc, depending on the study design. In the current example, we will be joining human demographic data and environmental metals exposure data collected from drinking water and human urine samples.

Let’s start by merging the example demographic data with the chemical measurement data using the base R function merge(). To learn more about this function, you can type ?merge, which brings up helpful information in the R console. To merge these datasets with the merge function, use the following code. The by = argument specifies the column used to match the rows of data.

full.data <- merge(demographic_data, chemical_data, by = "ID") 
dim(full.data)

## [1] 200  12

This merged dataframe contains 200 rows x 12 columns. Viewing this merged dataframe, we can see that the merge() function retained the first column in each original dataframe (ID), though did not replicate it since it was used as the identifier for merging. All other columns include their original data, just merged together by the IDs in the first column.

head(full.data)

##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

These datasets were actually quite easy to merge, since they had the same exact column identifier and number of rows. You can edit your script to include more specifics in instances when these may differ across datasets that you would like to merge. This option allows you to edit the name of the column that is used in each dataframe. Here, these are still the same “ID”, but you can see that adding the by.x and by.y arguments allows you to specify instances when different column names are used in the two datasets.

full.data <- merge(demographic_data, chemical_data, by.x = "ID", by.y = "ID") 

# Viewing data
head(full.data)

##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Note that after merging datasets, it is always helpful to check that the merging was done properly before proceeding with your data analysis. Helpful checks could include viewing the merged dataset, checking the numbers of rows and columns to make sure chunks of data are not missing, and searching for values (or strings) that exist in one dataset but not the other, among other mechanisms of QA/QC.

Filtering and Subsetting Data Using Base R Syntax

Filtering and subsetting data are useful tools when you need to focus on specific parts of your dataset for downstream analyses. These could represent, for example, specific samples or participants that meet certain criteria that you are interested in evaluating. It is also useful for removing unneeded variables or samples from dataframes as you are working through your script.

Note that in the examples that follow, we will create new dataframes that are distinguished from our original dataframe by adding sequential numbers to the end of the dataframe name (e.g., subset.data1, subset.data2, subset.data3). This style of dataframe naming is useful for the simple examples we are demonstrating, but in a full scripted analysis, we encourage the use of more descriptive dataframe names. For example, if you are subsetting your data to include only the first 100 rows, you could name that dataframe “data.first100.”

For this example, let’s first define a vector of columns that we want to keep in our analysis, then subset the data by keeping only the columns specified in our vector:

# Defining a vector of columns to keep in the analysis
subset.columns <- c("BMI", "MAge", "MEdu")

# Subsetting the data by selecting the columns represented in the defined 'subset.columns' vector
subset.data1 <- full.data[,subset.columns]

# Viewing the top of this subsetted dataframe
head(subset.data1)

##    BMI     MAge MEdu
## 1 27.7 22.99928    3
## 2 26.8 30.05142    3
## 3 33.2 28.04660    3
## 4 30.1 34.81796    3
## 5 37.4 42.68440    3
## 6 33.3 24.94960    3

We can also easily subset data based on row numbers. For example, to keep only the first 100 rows:

subset.data2 <- full.data[1:100,]

# Viewing the dimensions of this new dataframe
dim(subset.data2)

## [1] 100  12

To remove the first 100 rows, we use the same code as above, but include a - sign before our vector to indicate that these rows should be removed:

subset.data3 <- full.data[-c(1:100),]

# Viewing the dimensions of this new dataframe
dim(subset.data3)

## [1] 100  12

Conditional statements are also written to filter and subset data. A conditional statement is written to execute one block of code if the statement is true and a different block of code if the statement is false.

A conditional statement requires a Boolean or true/false statement that will be either TRUE or FALSE. A couple of the more commonly used functions used to create conditional statements include…

if(){} or an if statement means “execute R code when the condition is met”.
if(){} else{} or an if/else statement means “execute R code when condition 1 is met, if not execute R code for condition 2”.
ifelse() is a function that executes the same logic as an if/else statement. The first argument specifies a condition to be met. If that condition is met, R code in the second argument is executed, and if that condition is not met, R code in the third argument is executed.

There are six comparison operators that are used to created these Boolean values:

== means “equals”.
!= means “not equal”.
< means “less than”.
> means “greater than”.
<= means “less than or equal to”.
>= mean “greater than or equal to”.

There are also three logical operators that are used to create these Boolean values:

& means “and”.
| means “or”.
! means “not”.

We can filter data based on conditions using the subset() function. For example, the following code filters for subjects whose BMI is greater than 25 and who have a college education:

subset.data4 <- subset(full.data, BMI > 25 & MEdu == 3)

Additionally, we can subset and select specific columns we would like to keep, using the select argument within the subset() function:

# Filtering for subjects whose BMI is less than 22 or greater than 27
# Also selecting the BMI, maternal age, and maternal education columns
subset.data5 <- subset(full.data, BMI < 22 | BMI > 27, select = subset.columns)

For more information on the subset() function, see its associated documentation.

Melting and Casting Data using Base R Syntax

Melting and casting refers to the conversion of data to “long” or “wide” form as discussed previously in TAME 2.0 Module 1.4 Data Wrangling in Excel. You will often see data within the environmental health field in wide format, though long format is necessary for some procedures, such as plotting with ggplot2 and performing certain analyses.

Here, we’ll illustrate some example script to melt and cast data using the reshape2 package. Let’s first install and load the reshape2 package:

if (!requireNamespace("reshape2"))
  install.packages("reshape2");

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Using the fully merged dataframe, let’s remind ourselves what these data look like in the current dataframe format:

head(full.data)

##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

These data are represented by single subject identifiers listed as unique IDs per row, with associated environmental measures and demographic data organized across the columns. Thus, this dataframe is currently in wide (also known as casted) format.

Let’s convert this dataframe to long (also known as melted) format. Here, will will specify that we want a row for each unique sample ID + variable measure pair by using id = "ID":

full.melted <- melt(full.data, id = "ID") 

# Viewing this new dataframe
head(full.melted)

##   ID variable value
## 1  1      BMI  27.7
## 2  2      BMI  26.8
## 3  3      BMI  33.2
## 4  4      BMI  30.1
## 5  5      BMI  37.4
## 6  6      BMI  33.3

You can see here that each measure that was originally contained as a unique column has been reoriented, such that the original column header is now listed throughout the second column labeled variable. Then, the third column contains the value of this variable.

Let’s see an example view of the middle of this new dataframe:

full.melted[1100:1110,1:3]

##       ID variable    value
## 1100 100     DWAs 7.928885
## 1101 101     DWAs 8.677403
## 1102 102     DWAs 8.115183
## 1103 103     DWAs 7.134189
## 1104 104     DWAs 8.816142
## 1105 105     DWAs 7.487227
## 1106 106     DWAs 7.541973
## 1107 107     DWAs 6.313516
## 1108 108     DWAs 6.654474
## 1109 109     DWAs 7.564429
## 1110 110     DWAs 7.357122

Here, we can see a different variable (DWAs) now being listed. This continues throughout the entire dataframe, which has the following dimensions:

dim(full.melted)

## [1] 2200    3

Let’s now re-cast this dataframe back into wide format using the dcast() function. Here, we are telling the dcast() function to give us a sample (ID) for every variable in the column labeled variable. The column names from the variable column and corresponding values from the value column are then used to fill in the dataset:

full.cast <- dcast(full.melted, ID ~ variable) 
head(full.cast)

##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Here, we can see that this dataframe is back in its original casted (or wide) format. Now that we’re familiar with some base R functions to reshape our data, let’s answer our original question: What is the average urinary chromium concentration for each maternal education level?

Although it is not necessary to calculate the average, we could first subset our data frame to only include the two columns we are interested in (MEdu and UCr):

subset.data6 <- full.data[,c("MEdu", "UCr")]

head(subset.data6)

##   MEdu      UCr
## 1    3 42.60187
## 2    3 41.30757
## 3    3 36.47716
## 4    3 42.47987
## 5    3 47.78528
## 6    3 38.26386

Next, we will make a new data frame for each maternal education level:

# Creating new data frames based on maternal education category
data.matedu.1 <- subset(subset.data6, MEdu == 1)
data.matedu.2 <- subset(subset.data6, MEdu == 2)
data.matedu.3 <- subset(subset.data6, MEdu == 3)

# Previewing the first data frame to make sure our function is working as specified
head(data.matedu.1)

##    MEdu      UCr
## 14    1 38.59349
## 18    1 47.77878
## 37    1 35.33980
## 63    1 34.72255
## 66    1 34.13982
## 76    1 31.38145

Last, we can calculate the average urinary chromium concentration using each of our data frames:

mean(data.matedu.1$UCr)

## [1] 39.88055

mean(data.matedu.2$UCr)

## [1] 40.61807

mean(data.matedu.3$UCr)

## [1] 40.41556

With this, we can answer our Environmental Health Question:

What is the average urinary chromium concentration across different maternal education levels?

Answer: The average urinary Chromium concentrations are 39.9 µg/L for participants with less than high school education, 40.6 µg/L for participants with high school or some college education, and 40.4 µg/L for participants with college education or greater.

Introduction to Tidyverse

Tidyverse is a collection of packages that are commonly used to more efficiently organize and manipulate datasets in R. This collection of packages has its own specific type of syntax and formatting that differ slightly from base R functions. There are eight core tidyverse packages:

For data visualization and exploration:
- ggplot2
For data wrangling and transformation:
- dplyr
- tidyr
- stringr
- forcats
For data import and management:
- tibble
- readr
For functional programming:
- purr

Here, we will carry out all the of the same data organization exercises demonstrated above using packages that are part of The Tidyverse, specifically using functions that are part of the dplyr and tidyr packages.

Downloading and Loading the Tidyverse Package

If you don’t have tidyverse already installed, you will need to install it using:

if(!require(tidyverse)) 
    install.packages("tidyverse")

And then load the tidyverse package using:

library(tidyverse)

Note that by loading the tidyverse package, you are also loading all of the packages included within The Tidyverse and do not need to separately load these packages.

Merging Data Using Tidyverse Syntax

To merge the same example dataframes using tidyverse, you can run the following script:

full.data.tidy <- inner_join(demographic_data, chemical_data, by = "ID")

head(full.data.tidy)

##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Note that you can still merge dataframes that have different ID column names with the argument by = c("ID.x", "ID.y"). tidyverse also has other join, functions, shown in the graphic below (source):

inner_join keeps only rows that have matching ID variables in both datasets
full_join keeps the rows in both datasets
left_join matches rows based on the ID variables in the first dataset (and omits any rows from the second dataset that do not have matching ID variables in the first dataset)
right_join matches rows based on ID variables in the second dataset (and omits any rows from the first dataset that do not have matching ID variables in the second dataset)
anti_join(x,y) keeps the rows that are unique to the first dataset
anti_join(y,x) keeps the rows that are unique to the second dataset

The Pipe Operator

One of the most important elements of Tidyverse syntax is use of the pipe operator (%>%). The pipe operator can be used to chain multiple functions together. It takes the object (typically a dataframe) to the left of the pipe operator and passes it to the function to the right of the pipe operator. Multiple pipes can be used in chain to execute multiple data cleaning steps without the need for intermediate dataframes. The pipe operator can be used to pass data to functions within all of the Tidyverse universe packages, not just the functions demonstrated here.

Below, we can see the same code executed above, but this time with the pipe operator. The demographic_data dataframe is passed to inner_join() as the first argument to that function, with the following arguments remaining the same.

full.data.tidy2 <- demographic_data %>% 
  inner_join(chemical_data, by = "ID")

head(full.data.tidy2)

##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Because the pipe operator is often used in a chain, it is best practice is to start a new line after each pipe operator, with the new lines of code indented. This makes code with multiple piped steps easier to follow. However, if just one function is being executed, the pipe operator can be used on the same line as the input and function or omitted altogether (as shown in the previous two code chunks). Here is an example of placing the function to the right of the pipe operator on a new line, with placeholder functions shown as additional steps:

full.data.tidy3 <- demographic_data %>% 
  inner_join(chemical_data, by = "ID") %>%
  additional_function_1() %>%
  additional_function_2()

Filtering and Subsetting Data Using Tidyverse Syntax

Column-wise functions

The select() function is used to subset columns in Tidyverse. Here, we can use our previously defined vector subset.columns in the select() function to keep only the columns in our subset.columns vector. The all_of() function tells the select() to keep all of the columns that match elements of the subset.columns vector.

subset.tidy1 <- full.data.tidy %>% 
    select(all_of(subset.columns))

head(subset.tidy1)

##    BMI     MAge MEdu
## 1 27.7 22.99928    3
## 2 26.8 30.05142    3
## 3 33.2 28.04660    3
## 4 30.1 34.81796    3
## 5 37.4 42.68440    3
## 6 33.3 24.94960    3

There are many different ways that select() can be used. See below for some examples using dummy variable names:

# Select specific ranges in the dataframe
data <- data %>%
  select(start_column_1:end_column_1)

data <- data %>%
  select(c(start_column_1:end_column_1, start_column_2:end_column_2))

# Select columns that match the elements in a character vector an an additional range of columns
data <- data %>%
  select(c(all_of(character_vector), start_column_1:end_column_1))

To select columns that have names that contain specific strings, you can use functions such as starts_with(), ends_with(), and contains(). These functions allow you to ignore the case of the strings with ignore.case = TRUE. These arguments can be combined with specific column names and other selection ranges.

data <- data %>%
  select(starts_with("starting_string"))

data <- data %>%
  select(other_column_to_keep, starts_with("starting_string"))

To remove columns using tidyverse, you can use similar code, but include a - sign before the argument defining the columns.

# Removing columns
subset.tidy2 <- full.data.tidy %>% 
    select(-all_of(subset.columns))

# Viewing this new dataframe
head(subset.tidy2)

##   ID       BW GA     DWAs     DWCd     DWCr       UAs       UCd      UCr
## 1  1 3180.058 34 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2  2 3210.823 43 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3  3 3311.551 40 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4  4 3266.844 32 5.906656 2.075259 50.92745  8.719123 0.9364825 42.47987
## 5  5 3664.088 35 7.181873 2.762643 55.16882  9.436559 1.4977829 47.78528
## 6  6 3328.988 40 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386

Row-wise functions

The slice() function can be used to keep or remove a certain number of rows based on their position within the dataframe. For example, we can retain only the first 100 rows using the following code:

subset.tidy3 <- full.data.tidy %>% 
    slice(1:100) 

dim(subset.tidy3)

## [1] 100  12

Or, we can remove the first 100 rows:

subset.tidy4 <- full.data.tidy %>% 
    slice(-c(1:100))

dim(subset.tidy4)

## [1] 100  12

The related functions slice_min() and slice_max() can be used to select rows with the smallest or largest values of a variable.

The filter() function can be used to keep or remove specific rows based on conditional statements. For example, we can keep only rows where BMI is greater than 25 and age is greater than 31:

subset.tidy5 <- full.data.tidy %>% 
    filter(BMI > 25 & MAge > 31)

dim(subset.tidy5)

## [1] 49 12

Combining column and row-wise functions

Now, we can see how Tidyverse makes it easy to chain together multiple data manipulation steps. Here, we first filter rows based on values for BMI and age, then we select our columns of interest:

subset.tidy6 <- full.data.tidy %>% 
    filter(BMI > 25 & MAge > 31) %>% 
    select(BMI, MAge, MEdu)

head(subset.tidy6)

##    BMI     MAge MEdu
## 1 30.1 34.81796    3
## 2 37.4 42.68440    3
## 3 36.9 33.58589    3
## 4 33.7 33.82961    3
## 5 25.7 37.08028    3
## 6 28.4 47.85761    3

Melting and Casting Data Using Tidyverse Syntax

To melt and cast data in Tidyverse, you can use the pivot functions (i.e., pivot_longer() or pivot_wider()).

The first argument in the pivot_longer() function specifies which columns should be pivoted. This can be specified with either positive or negative selection - i.e., naming columns to pivot with a vector or range or naming columns not to pivot with a - sign. Here, we are telling the function to pivot all of the columns except the ID column, which we need to keep to be able to trace back which values came from which subject. The names_to = argument allows you to set what you want to name the column that stores the variable names (the column names in wide format). The values_to = argument allows you to set what you want to name the column that stores the values. We almost always call these columns “var” and “value”, respectively, but you can name them anything that makes sense for your dataset.

full.pivotlong <- full.data.tidy %>% 
    pivot_longer(-ID, names_to = "var", values_to = "value")

head(full.pivotlong, 15)

## # A tibble: 15 × 3
##       ID var      value
##    <int> <chr>    <dbl>
##  1     1 BMI     27.7  
##  2     1 MAge    23.0  
##  3     1 MEdu     3    
##  4     1 BW    3180.   
##  5     1 GA      34    
##  6     1 DWAs     6.43 
##  7     1 DWCd     1.29 
##  8     1 DWCr    51.7  
##  9     1 UAs     10.2  
## 10     1 UCd      0.754
## 11     1 UCr     42.6  
## 12     2 BMI     26.8  
## 13     2 MAge    30.1  
## 14     2 MEdu     3    
## 15     2 BW    3211.

To pivot our data back to wide format, we can use pivot_wider(), which will pull the column names from the column specified in the names_from = argument and the corresponding values from the column specified in the values_from = argument.

full.pivotwide <- full.pivotlong %>% 
    pivot_wider(names_from = "var", values_from = "value")

head(full.pivotwide)

## # A tibble: 6 × 12
##      ID   BMI  MAge  MEdu    BW    GA  DWAs  DWCd  DWCr   UAs   UCd   UCr
##   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1  27.7  23.0     3 3180.    34  6.43  1.29  51.7 10.2  0.754  42.6
## 2     2  26.8  30.1     3 3211.    43  7.83  1.80  50.1 11.8  0.979  41.3
## 3     3  33.2  28.0     3 3312.    40  7.52  1.29  48.7 10.1  0.190  36.5
## 4     4  30.1  34.8     3 3267.    32  5.91  2.08  50.9  8.72 0.936  42.5
## 5     5  37.4  42.7     3 3664.    35  7.18  2.76  55.2  9.44 1.50   47.8
## 6     6  33.3  24.9     3 3329.    40  9.72  3.05  51.1 11.6  1.66   38.3

Now that we’re familiar with some tidyverse functions to reshape our data, let’s answer our original question: What is the average urinary Chromium concentration for each maternal education level?

We can use the group_by() function to group our dataset by education class, then the summarize function to calculate the mean of our variable of interest within each class. Note how much shorter and more efficient this code is than the code we used to calculate the same values using base R!

full.data %>%
  group_by(MEdu) %>%
  summarize(Avg_UCr = mean(UCr))

## # A tibble: 3 × 2
##    MEdu Avg_UCr
##   <int>   <dbl>
## 1     1    39.9
## 2     2    40.6
## 3     3    40.4

For more detailed and advanced examples of pivoting in Tidyverse, see the Tidyverse Pivoting Vignette.

Concluding Remarks

This training module provides an introductory level overview of data organization and manipulation basics in base R and Tidyverse, including merging, filtering, subsetting, melting, and casting, and demonstrates these methods with an environmentally relevant dataset. These methods are used regularly in scripted analyses and are important preparation steps for almost all downstream analyses and visualizations.

Test Your Knowledge

What subjects, arranged from highest to lowest drinking water cadmium levels, had babies at at least 35 weeks and had urinary cadmium levels of at least 1.5 µg/L?

Hint: Try using the arrange() function from the tidyverse package.