1.2 Data Organization Basics

This training module was developed by Dr. Kyle R. Roell and Dr. Julia E. Rager

Fall 2021

Introduction to Training Module

Data within the fields of exposure science, toxicology, and public health are very rarely prepared and ready for all statistical analyses/visualization code. The beginning of almost any scripted analysis includes important formatting steps. These steps largely encompass data organization, manipulation, and other steps in preparation for actual statistical analyses/visualizations. Data organization and manipulation generally refers to organizing and formatting data in a way that makes it easier to read and work with. This can be done through several approaches, including:

  • Base R operations and functions, or
  • A collection of packages (and philosophy) known as Tidyverse.

In this training tutorial we will go over some of the most common ways you can organize and manipulate data, including:

  • Merging data
  • Filtering and subsetting data
  • Melting and casting data

These approaches will first be taught using the basic operations and functions in R. Then, the exact same approaches will be taught using the Tidyverse package and associated functions and syntax.

These data manipulation and organization methods are demonstrated using an example environmentally relevant human cohort dataset. This cohort was generated by creating data distributions randomly pulled from our previously published cohorts, resulting in a bespoke dataset for these training purposes with associated demographic data and variable environmental exposure metrics from metal levels obtained using sources of drinking water and human urine samples.

Set your working directory

In preparation, first let’s set our working directory to the folderpath that contains our input files

setwd("/filepath to where your input files are")

Note that in macOS, filepaths use “/” as folder separaters; whereas in PCs, filepaths use “".

Importing example datasets

Then let’s read in our example datasets

demo.data <- read.csv("Module1_2/Module1_2_DemographicData.csv")
chem.data <- read.csv("Module1_2/Module1_2_ChemicalData.csv")

Viewing example datasets

Let’s see what these datasets look like, starting with the chemical measures:

dim(chem.data)
## [1] 200   7

The chemical measurement dataset includes 200 rows x 7 columns

chem.data[1:10,1:7]
##    ID     DWAs     DWCd     DWCr       UAs       UCd      UCr
## 1   1 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2   2 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3   3 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4   4 5.906656 2.075259 50.92745  8.719123 0.9364825 42.47987
## 5   5 7.181873 2.762643 55.16882  9.436559 1.4977829 47.78528
## 6   6 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386
## 7   7 6.268547 1.218410 52.08578  8.887948 0.6347667 39.45535
## 8   8 6.718448 1.414975 54.96740  9.304968 0.6658849 45.09987
## 9   9 9.074928 2.727755 55.72826 10.818153 1.6585757 42.58577
## 10 10 5.771691 2.410993 47.06552  8.747217 1.7354305 34.80661

These data are organized according to subject ID (first column), followed by measures of:

  • DWAs (drinking water arsenic levels in µg/L)
  • DWCd (drinking water cadmium levels in µg/L)
  • DWCr (drinking water chromium levels in µg/L)
  • UAs (urinary arsenic levels in µg/L)
  • UCd (urinary cadmium levels in µg/L)
  • UCr (urinary chromium levels in µg/L)

Now let’s view the demographic data:

dim(demo.data)
## [1] 200   6

The subject demographic dataset includes 200 rows x 6 columns

demo.data[1:10,1:6]
##    ID  BMI     MAge MEdu       BW GA
## 1   1 27.7 22.99928    3 3180.058 34
## 2   2 26.8 30.05142    3 3210.823 43
## 3   3 33.2 28.04660    3 3311.551 40
## 4   4 30.1 34.81796    3 3266.844 32
## 5   5 37.4 42.68440    3 3664.088 35
## 6   6 33.3 24.94960    3 3328.988 40
## 7   7 24.8 29.54798    3 3061.949 30
## 8   8 16.9 24.94954    3 3332.539 38
## 9   9 36.9 33.58589    3 3260.482 39
## 10 10 21.7 39.29018    3 3141.723 35

These data are organized according to subject ID (first column) followed by the following subject information:

  • BMI (body mass index)
  • MAge (maternal age, years)
  • MEdu (maternal education, 1= “less than high school”; 2= “high school or some college”; 3= “college or greater”)
  • BW (body weight, grams)
  • GA (gestational age, week)

Data Manipulation using Base R

Merging Data using Base R Syntax

Merging datasets represents the joining together of two or more datasets, while connecting the datasets using a common identifier (generally some sort of ID). This is useful if you have multiple datasets describing different aspects of the study, different variables, or different measures across the same samples. Samples could correspond to the same study participants, animals, cell culture samples, environmental media samples, etc, depending on the study design. In the current example, we will be joining human demographic data and environmental metals exposure data collected from drinking water and human urine samples.

Let’s start by merging the example demographic data with the chemical measurement data using the base R function of “merge”. To learn more about this function, you can type the following:

?merge

which brings up helpful information in the R console

To merge these datasets using the merge function, use the following code:

# Note that we specify to merge these datasets by their shared ID column
full.data <- merge(demo.data, chem.data, by="ID") 
dim(full.data) 
## [1] 200  12

This merged dataframe contains 200 rows x 12 columns

Viewing this merged dataframe

full.data[1:10, 1:12]
##    ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1   1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2   2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3   3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4   4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5   5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6   6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
## 7   7 24.8 29.54798    3 3061.949 30 6.268547 1.218410 52.08578  8.887948
## 8   8 16.9 24.94954    3 3332.539 38 6.718448 1.414975 54.96740  9.304968
## 9   9 36.9 33.58589    3 3260.482 39 9.074928 2.727755 55.72826 10.818153
## 10 10 21.7 39.29018    3 3141.723 35 5.771691 2.410993 47.06552  8.747217
##          UCd      UCr
## 1  0.7537104 42.60187
## 2  0.9789506 41.30757
## 3  0.1903262 36.47716
## 4  0.9364825 42.47987
## 5  1.4977829 47.78528
## 6  1.6645837 38.26386
## 7  0.6347667 39.45535
## 8  0.6658849 45.09987
## 9  1.6585757 42.58577
## 10 1.7354305 34.80661

We can see that the merge function retained the first column in each original dataframe (ID), though did not replicate it since it was used as the identifier to merge off of. All other columns include their original data, just merged together by the IDs in the first column.

These datasets were actually quite easy to merge, since they had the same exact column identifier and number of rows. You can edit your script to include more specifics in instances when these may differ across datasets that you would like to merge. For example:

full.data <- merge(demo.data, chem.data, by.x="ID", by.y="ID") 
# This option allows you to edit the column header text that is used in each 
# dataframe. Here, these are still the same "ID", but you can see that adding 
# this script allows you to specify instances when differ header text is used.

Filtering and Subsetting Data using Base R Syntax

Filtering and subsetting data are useful tools when you need to focus your dataset to highlight data you are interested in analyzing downstream. These could represent, for example, specific samples or participants that meet certain criteria that you are interested in evaluating. It is also useful for simply removing particular variables or samples from dataframes as you are working through your script. These methods are illustrated here.

For this example, let’s first define a vector of columns that we want to keep in our analysis

subset.columns <- c("BMI", "MAge", "MEdu")
subset.columns
## [1] "BMI"  "MAge" "MEdu"

Now we can simply subset our data using those columns

# Subsetting the data by selecting the columns represented in the defined 
# 'subset.columns' vector
subset.data1 <- full.data[,subset.columns] 

# Viewing the top of this subsetted dataframe
head(subset.data1) 
##    BMI     MAge MEdu
## 1 27.7 22.99928    3
## 2 26.8 30.05142    3
## 3 33.2 28.04660    3
## 4 30.1 34.81796    3
## 5 37.4 42.68440    3
## 6 33.3 24.94960    3

Conversely, if we want to remove all columns except those that we are interested in within the ‘subset.columns’ vector, we can write the code as follows (to achieve the same results).

Note that we have to first create a vector of TRUE/FALSE’s here to execute the removal script written below:

# First specify which columns we would like to remove
remove.columns <- colnames(full.data) %in% subset.columns 

# Viewing this new vector
remove.columns 
##  [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This creates a vector of TRUE/FALSE’s denoting whether or not each column is included in the ‘subset.columns’ vector

Now we can subset our dataset. Here, we decide to keep those that are labeled ‘FALSE’ in the remove.columns vector. This will remove the columns that are NOT contained in the subset.columns vector

subset.data2 <- full.data[,!remove.columns]

# Viewing the top of this dataframe
head(subset.data2)
##   ID       BW GA     DWAs     DWCd     DWCr       UAs       UCd      UCr
## 1  1 3180.058 34 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2  2 3210.823 43 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3  3 3311.551 40 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4  4 3266.844 32 5.906656 2.075259 50.92745  8.719123 0.9364825 42.47987
## 5  5 3664.088 35 7.181873 2.762643 55.16882  9.436559 1.4977829 47.78528
## 6  6 3328.988 40 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386

We can also easily subset data based on row numbers. For example, to keep only the first 100 rows:

subset.data3 <- full.data[1:100,]

# Viewing the dimensions of this new dataframe
dim(subset.data3)
## [1] 100  12

To remove the first 100 rows:

subset.data4 <- full.data[-c(1:100),]

# Viewing the dimensions of this new dataframe
dim(subset.data4)
## [1] 100  12

To filter data using conditional statements:

subset.data5 <- full.data[which(full.data$BMI > 25 & full.data$MAge > 31),]

# Viewing the top of this new dataframe
head(subset.data5)
##    ID  BMI     MAge MEdu       BW GA     DWAs      DWCd     DWCr       UAs
## 4   4 30.1 34.81796    3 3266.844 32 5.906656 2.0752589 50.92745  8.719123
## 5   5 37.4 42.68440    3 3664.088 35 7.181873 2.7626433 55.16882  9.436559
## 9   9 36.9 33.58589    3 3260.482 39 9.074928 2.7277549 55.72826 10.818153
## 13 13 33.7 33.82961    3 3481.293 36 7.101634 0.8443918 47.11677  9.967185
## 22 22 25.7 37.08028    3 3387.046 43 7.207447 2.8088453 48.08648  9.446643
## 31 31 28.4 47.85761    3 3173.033 30 6.032807 2.1929549 45.71856  9.917588
##           UCd      UCr
## 4   0.9364825 42.47987
## 5   1.4977829 47.78528
## 9   1.6585757 42.58577
## 13 -0.3466431 36.74220
## 22  1.9891049 34.16921
## 31  1.1194851 37.82297

Filtering data based on conditions can also be done using the subset function:

subset.data6 <- subset(full.data, BMI > 25 & MAge > 31)

Additionally, we can subset and select specific columns we would like to keep, using ‘select’ within the subset function:

subset.data7 <- subset(full.data, BMI < 22 | BMI > 27, 
                       select=c("BMI", "MAge", "MEdu"))

For more information on the subset function, see its associated RDocumentation website.

Melting and Casting Data using Base R Syntax

Melting and casting refers to the conversion of data to “long” or “wide” form. You will often see data within the environmental health field in wide format; though long format is necessary for some procedures, such as plotting with ggplot2.

Here, we’ll illustrate some example script to melt and cast data using the reshape2 package. Let’s first load the reshape2 library:

library(reshape2)

Using the fully merged dataframe, let’s remind ourselves what these data look like in the current dataframe format:

head(full.data)
##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

These data are represented by single subject identifiers listed as unique IDs per row, with associated environmental measures and demographic data organized across the columns. Thus, this dataframe is currently in wide (also known as casted) format.

Let’s convert this dataframe to long (also known as melted) format:

# Here, we are saying that we want a row for each unique 
# sample ID - variable measure pair
full.melted <- melt(full.data, id="ID") 

# Viewing this new dataframe
head(full.melted) 
##   ID variable value
## 1  1      BMI  27.7
## 2  2      BMI  26.8
## 3  3      BMI  33.2
## 4  4      BMI  30.1
## 5  5      BMI  37.4
## 6  6      BMI  33.3

You can see here that each measure that was originally contained as a unique column has been reoriented, such that the original column header is now listed throughout the second column labeled “variable”. Then, the third column contains the value of this variable.

Let’s see an example view of the middle of this new dataframe

full.melted[1100:1110,1:3]
##       ID variable    value
## 1100 100     DWAs 7.928885
## 1101 101     DWAs 8.677403
## 1102 102     DWAs 8.115183
## 1103 103     DWAs 7.134189
## 1104 104     DWAs 8.816142
## 1105 105     DWAs 7.487227
## 1106 106     DWAs 7.541973
## 1107 107     DWAs 6.313516
## 1108 108     DWAs 6.654474
## 1109 109     DWAs 7.564429
## 1110 110     DWAs 7.357122

Here, we can see a different variable (DWAs) now being listed. This continues throughout the entire dataframe, which has the following dimensions:

dim(full.melted)
## [1] 2200    3

Thus, this dataframe is clearly melted, in long format.

Let’s now re-cast this dataframe back into wide format using the ‘dcast’ function

# Here, we are telling the dcast 
# function to give us a sample (ID) for every variable in the column labeled 'variable'. 
# Then it automatically fills the dataframe with values from the 'value' column
full.cast <- dcast(full.melted, ID ~ variable) 
head(full.cast)
##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Here, we can see that this dataframe is back in its original casted (or wide) format.

Introduction to Tidyverse

Tidyverse is a collection of packages that are commonly used to more efficiently organize and manipulate datasets in R. This collection of packages has its own specific type of syntax, dataset and formatting protocols that slightly differ from the Base R functions. Here, we will carry out all the of the same data organization exercises described above using Tidyverse.

Downloading and Loading the Tidyverse Package

If you don’t have tidyverse already installed, you will need to install it using:

if(!require(tidyverse)) install.packages("tidyverse")

And then load the tidyverse package using:

library(tidyverse)

Merging Data using Tidyverse Syntax

To merge the same example dataframes using tidyverse, you can run the following script:

full.data.tidy <- inner_join(demo.data, chem.data, by="ID")
# Note, for future scripting purposes, we can still merge with different IDs 
# using: by = c("ID.Demo"="ID.Chem")

head(full.data.tidy)
##   ID  BMI     MAge MEdu       BW GA     DWAs     DWCd     DWCr       UAs
## 1  1 27.7 22.99928    3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2  2 26.8 30.05142    3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3  3 33.2 28.04660    3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4  4 30.1 34.81796    3 3266.844 32 5.906656 2.075259 50.92745  8.719123
## 5  5 37.4 42.68440    3 3664.088 35 7.181873 2.762643 55.16882  9.436559
## 6  6 33.3 24.94960    3 3328.988 40 9.723429 3.054057 51.14812 11.589403
##         UCd      UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386

Filtering and Subsetting Data using Tidyverse Syntax

To subset columns in tidyverse, run the following:

subset.tidy1 = full.data.tidy %>% select(all_of(subset.columns))
head(subset.tidy1)
##    BMI     MAge MEdu
## 1 27.7 22.99928    3
## 2 26.8 30.05142    3
## 3 33.2 28.04660    3
## 4 30.1 34.81796    3
## 5 37.4 42.68440    3
## 6 33.3 24.94960    3

Note that you can also include column identifiers that may get dropped in the subsetting vector here:

# Note that we're including a 'fake' column here 'NotAColName' to illustrate 
# how to incorporate additional columns; though this column gets dropped in 
# the next line of code
subset.columns2 <- c(subset.columns, "NotAColName")

# Viewing this new vector
subset.columns2
## [1] "BMI"         "MAge"        "MEdu"        "NotAColName"
subset.tidy2 <- full.data.tidy %>% select(any_of(subset.columns2))

# Viewing the top of this new dataframe
head(subset.tidy2) 
##    BMI     MAge MEdu
## 1 27.7 22.99928    3
## 2 26.8 30.05142    3
## 3 33.2 28.04660    3
## 4 30.1 34.81796    3
## 5 37.4 42.68440    3
## 6 33.3 24.94960    3

Note that the ‘fake’ column ‘NotAColName’ gets automatically dropped here

To remove columns using tidyverse, you can run the following:

# Removing columns
subset.tidy3 <- full.data.tidy %>% select(-subset.columns)

# Viewing this new dataframe
head(subset.tidy3) 
##   ID       BW GA     DWAs     DWCd     DWCr       UAs       UCd      UCr
## 1  1 3180.058 34 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2  2 3210.823 43 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3  3 3311.551 40 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4  4 3266.844 32 5.906656 2.075259 50.92745  8.719123 0.9364825 42.47987
## 5  5 3664.088 35 7.181873 2.762643 55.16882  9.436559 1.4977829 47.78528
## 6  6 3328.988 40 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386

Subsetting rows using tidyverse:

# Selecting to retain the first 100 rows
subset.tidy4 <- full.data.tidy %>% slice(1:100) 
dim(subset.tidy4)
## [1] 100  12
# Selecting to remove the first 100 rows
subset.tidy5 <- full.data.tidy %>% slice(-c(1:100))
dim(subset.tidy5)
## [1] 100  12

Filtering data based on conditional statements using tidyverse:

subset.tidy6 <- full.data.tidy %>% filter(BMI > 25 & MAge > 31)
dim(subset.tidy6)
## [1] 49 12

Another example of a conditional statement that can be used to filter data:

subset.tidy7 <- full.data.tidy %>% filter(BMI > 25 & MAge > 31) %>% select(BMI, MAge, MEdu)

Melting and Casting Data using Tidyverse Syntax

To melt and cast data in tidyverse, you can use the ‘pivot’ functions (i.e., ‘pivot_longer’ or ‘pivot_wider’). These are exemplified below.

Melting to long format using tidyverse:

full.pivotlong <- full.data.tidy %>% pivot_longer(-ID, names_to = "var", values_to = "value")
head(full.pivotlong, 15)
## # A tibble: 15 × 3
##       ID var      value
##    <int> <chr>    <dbl>
##  1     1 BMI     27.7  
##  2     1 MAge    23.0  
##  3     1 MEdu     3    
##  4     1 BW    3180.   
##  5     1 GA      34    
##  6     1 DWAs     6.43 
##  7     1 DWCd     1.29 
##  8     1 DWCr    51.7  
##  9     1 UAs     10.2  
## 10     1 UCd      0.754
## 11     1 UCr     42.6  
## 12     2 BMI     26.8  
## 13     2 MAge    30.1  
## 14     2 MEdu     3    
## 15     2 BW    3211.

Casting to wide format using tidyverse:

full.pivotwide <- full.pivotlong %>% pivot_wider(names_from = "var", values_from="value")
head(full.pivotwide)
## # A tibble: 6 × 12
##      ID   BMI  MAge  MEdu    BW    GA  DWAs  DWCd  DWCr   UAs   UCd   UCr
##   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1  27.7  23.0     3 3180.    34  6.43  1.29  51.7 10.2  0.754  42.6
## 2     2  26.8  30.1     3 3211.    43  7.83  1.80  50.1 11.8  0.979  41.3
## 3     3  33.2  28.0     3 3312.    40  7.52  1.29  48.7 10.1  0.190  36.5
## 4     4  30.1  34.8     3 3267.    32  5.91  2.08  50.9  8.72 0.936  42.5
## 5     5  37.4  42.7     3 3664.    35  7.18  2.76  55.2  9.44 1.50   47.8
## 6     6  33.3  24.9     3 3329.    40  9.72  3.05  51.1 11.6  1.66   38.3

Concluding Remarks

Together, this training module provides introductory level information on the basics of data organization in R. The important data organization / manipulation methods of merging, filtering, subsetting, melting, and casted are presented on an environmentally relevant dataset.