1.2 Data Organization Basics
This training module was developed by Dr. Kyle R. Roell and Dr. Julia E. Rager
Fall 2021
Introduction to Training Module
Data within the fields of exposure science, toxicology, and public health are very rarely prepared and ready for all statistical analyses/visualization code. The beginning of almost any scripted analysis includes important formatting steps. These steps largely encompass data organization, manipulation, and other steps in preparation for actual statistical analyses/visualizations. Data organization and manipulation generally refers to organizing and formatting data in a way that makes it easier to read and work with. This can be done through several approaches, including:
- Base R operations and functions, or
- A collection of packages (and philosophy) known as Tidyverse.
In this training tutorial we will go over some of the most common ways you can organize and manipulate data, including:
- Merging data
- Filtering and subsetting data
- Melting and casting data
These approaches will first be taught using the basic operations and functions in R. Then, the exact same approaches will be taught using the Tidyverse package and associated functions and syntax.
These data manipulation and organization methods are demonstrated using an example environmentally relevant human cohort dataset. This cohort was generated by creating data distributions randomly pulled from our previously published cohorts, resulting in a bespoke dataset for these training purposes with associated demographic data and variable environmental exposure metrics from metal levels obtained using sources of drinking water and human urine samples.
Set your working directory
In preparation, first let’s set our working directory to the folderpath that contains our input files
setwd("/filepath to where your input files are")
Note that in macOS, filepaths use “/” as folder separaters; whereas in PCs, filepaths use “".
Importing example datasets
Then let’s read in our example datasets
<- read.csv("Module1_2/Module1_2_DemographicData.csv")
demo.data <- read.csv("Module1_2/Module1_2_ChemicalData.csv") chem.data
Viewing example datasets
Let’s see what these datasets look like, starting with the chemical measures:
dim(chem.data)
## [1] 200 7
The chemical measurement dataset includes 200 rows x 7 columns
1:10,1:7] chem.data[
## ID DWAs DWCd DWCr UAs UCd UCr
## 1 1 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2 2 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3 3 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4 4 5.906656 2.075259 50.92745 8.719123 0.9364825 42.47987
## 5 5 7.181873 2.762643 55.16882 9.436559 1.4977829 47.78528
## 6 6 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386
## 7 7 6.268547 1.218410 52.08578 8.887948 0.6347667 39.45535
## 8 8 6.718448 1.414975 54.96740 9.304968 0.6658849 45.09987
## 9 9 9.074928 2.727755 55.72826 10.818153 1.6585757 42.58577
## 10 10 5.771691 2.410993 47.06552 8.747217 1.7354305 34.80661
These data are organized according to subject ID (first column), followed by measures of:
- DWAs (drinking water arsenic levels in µg/L)
- DWCd (drinking water cadmium levels in µg/L)
- DWCr (drinking water chromium levels in µg/L)
- UAs (urinary arsenic levels in µg/L)
- UCd (urinary cadmium levels in µg/L)
- UCr (urinary chromium levels in µg/L)
Now let’s view the demographic data:
dim(demo.data)
## [1] 200 6
The subject demographic dataset includes 200 rows x 6 columns
1:10,1:6] demo.data[
## ID BMI MAge MEdu BW GA
## 1 1 27.7 22.99928 3 3180.058 34
## 2 2 26.8 30.05142 3 3210.823 43
## 3 3 33.2 28.04660 3 3311.551 40
## 4 4 30.1 34.81796 3 3266.844 32
## 5 5 37.4 42.68440 3 3664.088 35
## 6 6 33.3 24.94960 3 3328.988 40
## 7 7 24.8 29.54798 3 3061.949 30
## 8 8 16.9 24.94954 3 3332.539 38
## 9 9 36.9 33.58589 3 3260.482 39
## 10 10 21.7 39.29018 3 3141.723 35
These data are organized according to subject ID (first column) followed by the following subject information:
- BMI (body mass index)
- MAge (maternal age, years)
- MEdu (maternal education, 1= “less than high school”; 2= “high school or some college”; 3= “college or greater”)
- BW (body weight, grams)
- GA (gestational age, week)
Data Manipulation using Base R
Merging Data using Base R Syntax
Merging datasets represents the joining together of two or more datasets, while connecting the datasets using a common identifier (generally some sort of ID). This is useful if you have multiple datasets describing different aspects of the study, different variables, or different measures across the same samples. Samples could correspond to the same study participants, animals, cell culture samples, environmental media samples, etc, depending on the study design. In the current example, we will be joining human demographic data and environmental metals exposure data collected from drinking water and human urine samples.
Let’s start by merging the example demographic data with the chemical measurement data using the base R function of “merge”. To learn more about this function, you can type the following:
?merge
which brings up helpful information in the R console
To merge these datasets using the merge function, use the following code:
# Note that we specify to merge these datasets by their shared ID column
<- merge(demo.data, chem.data, by="ID")
full.data dim(full.data)
## [1] 200 12
This merged dataframe contains 200 rows x 12 columns
Viewing this merged dataframe
1:10, 1:12] full.data[
## ID BMI MAge MEdu BW GA DWAs DWCd DWCr UAs
## 1 1 27.7 22.99928 3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2 2 26.8 30.05142 3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3 3 33.2 28.04660 3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4 4 30.1 34.81796 3 3266.844 32 5.906656 2.075259 50.92745 8.719123
## 5 5 37.4 42.68440 3 3664.088 35 7.181873 2.762643 55.16882 9.436559
## 6 6 33.3 24.94960 3 3328.988 40 9.723429 3.054057 51.14812 11.589403
## 7 7 24.8 29.54798 3 3061.949 30 6.268547 1.218410 52.08578 8.887948
## 8 8 16.9 24.94954 3 3332.539 38 6.718448 1.414975 54.96740 9.304968
## 9 9 36.9 33.58589 3 3260.482 39 9.074928 2.727755 55.72826 10.818153
## 10 10 21.7 39.29018 3 3141.723 35 5.771691 2.410993 47.06552 8.747217
## UCd UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386
## 7 0.6347667 39.45535
## 8 0.6658849 45.09987
## 9 1.6585757 42.58577
## 10 1.7354305 34.80661
We can see that the merge function retained the first column in each original dataframe (ID), though did not replicate it since it was used as the identifier to merge off of. All other columns include their original data, just merged together by the IDs in the first column.
These datasets were actually quite easy to merge, since they had the same exact column identifier and number of rows. You can edit your script to include more specifics in instances when these may differ across datasets that you would like to merge. For example:
<- merge(demo.data, chem.data, by.x="ID", by.y="ID")
full.data # This option allows you to edit the column header text that is used in each
# dataframe. Here, these are still the same "ID", but you can see that adding
# this script allows you to specify instances when differ header text is used.
Filtering and Subsetting Data using Base R Syntax
Filtering and subsetting data are useful tools when you need to focus your dataset to highlight data you are interested in analyzing downstream. These could represent, for example, specific samples or participants that meet certain criteria that you are interested in evaluating. It is also useful for simply removing particular variables or samples from dataframes as you are working through your script. These methods are illustrated here.
For this example, let’s first define a vector of columns that we want to keep in our analysis
<- c("BMI", "MAge", "MEdu")
subset.columns subset.columns
## [1] "BMI" "MAge" "MEdu"
Now we can simply subset our data using those columns
# Subsetting the data by selecting the columns represented in the defined
# 'subset.columns' vector
<- full.data[,subset.columns]
subset.data1
# Viewing the top of this subsetted dataframe
head(subset.data1)
## BMI MAge MEdu
## 1 27.7 22.99928 3
## 2 26.8 30.05142 3
## 3 33.2 28.04660 3
## 4 30.1 34.81796 3
## 5 37.4 42.68440 3
## 6 33.3 24.94960 3
Conversely, if we want to remove all columns except those that we are interested in within the ‘subset.columns’ vector, we can write the code as follows (to achieve the same results).
Note that we have to first create a vector of TRUE/FALSE’s here to execute the removal script written below:
# First specify which columns we would like to remove
<- colnames(full.data) %in% subset.columns
remove.columns
# Viewing this new vector
remove.columns
## [1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This creates a vector of TRUE/FALSE’s denoting whether or not each column is included in the ‘subset.columns’ vector
Now we can subset our dataset. Here, we decide to keep those that are labeled ‘FALSE’ in the remove.columns vector. This will remove the columns that are NOT contained in the subset.columns vector
<- full.data[,!remove.columns]
subset.data2
# Viewing the top of this dataframe
head(subset.data2)
## ID BW GA DWAs DWCd DWCr UAs UCd UCr
## 1 1 3180.058 34 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2 2 3210.823 43 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3 3 3311.551 40 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4 4 3266.844 32 5.906656 2.075259 50.92745 8.719123 0.9364825 42.47987
## 5 5 3664.088 35 7.181873 2.762643 55.16882 9.436559 1.4977829 47.78528
## 6 6 3328.988 40 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386
We can also easily subset data based on row numbers. For example, to keep only the first 100 rows:
<- full.data[1:100,]
subset.data3
# Viewing the dimensions of this new dataframe
dim(subset.data3)
## [1] 100 12
To remove the first 100 rows:
<- full.data[-c(1:100),]
subset.data4
# Viewing the dimensions of this new dataframe
dim(subset.data4)
## [1] 100 12
To filter data using conditional statements:
<- full.data[which(full.data$BMI > 25 & full.data$MAge > 31),]
subset.data5
# Viewing the top of this new dataframe
head(subset.data5)
## ID BMI MAge MEdu BW GA DWAs DWCd DWCr UAs
## 4 4 30.1 34.81796 3 3266.844 32 5.906656 2.0752589 50.92745 8.719123
## 5 5 37.4 42.68440 3 3664.088 35 7.181873 2.7626433 55.16882 9.436559
## 9 9 36.9 33.58589 3 3260.482 39 9.074928 2.7277549 55.72826 10.818153
## 13 13 33.7 33.82961 3 3481.293 36 7.101634 0.8443918 47.11677 9.967185
## 22 22 25.7 37.08028 3 3387.046 43 7.207447 2.8088453 48.08648 9.446643
## 31 31 28.4 47.85761 3 3173.033 30 6.032807 2.1929549 45.71856 9.917588
## UCd UCr
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 9 1.6585757 42.58577
## 13 -0.3466431 36.74220
## 22 1.9891049 34.16921
## 31 1.1194851 37.82297
Filtering data based on conditions can also be done using the subset function:
<- subset(full.data, BMI > 25 & MAge > 31) subset.data6
Additionally, we can subset and select specific columns we would like to keep, using ‘select’ within the subset function:
<- subset(full.data, BMI < 22 | BMI > 27,
subset.data7 select=c("BMI", "MAge", "MEdu"))
For more information on the subset function, see its associated RDocumentation website.
Melting and Casting Data using Base R Syntax
Melting and casting refers to the conversion of data to “long” or “wide” form. You will often see data within the environmental health field in wide format; though long format is necessary for some procedures, such as plotting with ggplot2.
Here, we’ll illustrate some example script to melt and cast data using the reshape2 package. Let’s first load the reshape2 library:
library(reshape2)
Using the fully merged dataframe, let’s remind ourselves what these data look like in the current dataframe format:
head(full.data)
## ID BMI MAge MEdu BW GA DWAs DWCd DWCr UAs
## 1 1 27.7 22.99928 3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2 2 26.8 30.05142 3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3 3 33.2 28.04660 3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4 4 30.1 34.81796 3 3266.844 32 5.906656 2.075259 50.92745 8.719123
## 5 5 37.4 42.68440 3 3664.088 35 7.181873 2.762643 55.16882 9.436559
## 6 6 33.3 24.94960 3 3328.988 40 9.723429 3.054057 51.14812 11.589403
## UCd UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386
These data are represented by single subject identifiers listed as unique IDs per row, with associated environmental measures and demographic data organized across the columns. Thus, this dataframe is currently in wide (also known as casted) format.
Let’s convert this dataframe to long (also known as melted) format:
# Here, we are saying that we want a row for each unique
# sample ID - variable measure pair
<- melt(full.data, id="ID")
full.melted
# Viewing this new dataframe
head(full.melted)
## ID variable value
## 1 1 BMI 27.7
## 2 2 BMI 26.8
## 3 3 BMI 33.2
## 4 4 BMI 30.1
## 5 5 BMI 37.4
## 6 6 BMI 33.3
You can see here that each measure that was originally contained as a unique column has been reoriented, such that the original column header is now listed throughout the second column labeled “variable”. Then, the third column contains the value of this variable.
Let’s see an example view of the middle of this new dataframe
1100:1110,1:3] full.melted[
## ID variable value
## 1100 100 DWAs 7.928885
## 1101 101 DWAs 8.677403
## 1102 102 DWAs 8.115183
## 1103 103 DWAs 7.134189
## 1104 104 DWAs 8.816142
## 1105 105 DWAs 7.487227
## 1106 106 DWAs 7.541973
## 1107 107 DWAs 6.313516
## 1108 108 DWAs 6.654474
## 1109 109 DWAs 7.564429
## 1110 110 DWAs 7.357122
Here, we can see a different variable (DWAs) now being listed. This continues throughout the entire dataframe, which has the following dimensions:
dim(full.melted)
## [1] 2200 3
Thus, this dataframe is clearly melted, in long format.
Let’s now re-cast this dataframe back into wide format using the ‘dcast’ function
# Here, we are telling the dcast
# function to give us a sample (ID) for every variable in the column labeled 'variable'.
# Then it automatically fills the dataframe with values from the 'value' column
<- dcast(full.melted, ID ~ variable)
full.cast head(full.cast)
## ID BMI MAge MEdu BW GA DWAs DWCd DWCr UAs
## 1 1 27.7 22.99928 3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2 2 26.8 30.05142 3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3 3 33.2 28.04660 3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4 4 30.1 34.81796 3 3266.844 32 5.906656 2.075259 50.92745 8.719123
## 5 5 37.4 42.68440 3 3664.088 35 7.181873 2.762643 55.16882 9.436559
## 6 6 33.3 24.94960 3 3328.988 40 9.723429 3.054057 51.14812 11.589403
## UCd UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386
Here, we can see that this dataframe is back in its original casted (or wide) format.
Introduction to Tidyverse
Tidyverse is a collection of packages that are commonly used to more efficiently organize and manipulate datasets in R. This collection of packages has its own specific type of syntax, dataset and formatting protocols that slightly differ from the Base R functions. Here, we will carry out all the of the same data organization exercises described above using Tidyverse.
Downloading and Loading the Tidyverse Package
If you don’t have tidyverse already installed, you will need to install it using:
if(!require(tidyverse)) install.packages("tidyverse")
And then load the tidyverse package using:
library(tidyverse)
Merging Data using Tidyverse Syntax
To merge the same example dataframes using tidyverse, you can run the following script:
<- inner_join(demo.data, chem.data, by="ID")
full.data.tidy # Note, for future scripting purposes, we can still merge with different IDs
# using: by = c("ID.Demo"="ID.Chem")
head(full.data.tidy)
## ID BMI MAge MEdu BW GA DWAs DWCd DWCr UAs
## 1 1 27.7 22.99928 3 3180.058 34 6.426464 1.292941 51.67987 10.192695
## 2 2 26.8 30.05142 3 3210.823 43 7.832384 1.798535 50.10409 11.815088
## 3 3 33.2 28.04660 3 3311.551 40 7.516569 1.288461 48.74001 10.079057
## 4 4 30.1 34.81796 3 3266.844 32 5.906656 2.075259 50.92745 8.719123
## 5 5 37.4 42.68440 3 3664.088 35 7.181873 2.762643 55.16882 9.436559
## 6 6 33.3 24.94960 3 3328.988 40 9.723429 3.054057 51.14812 11.589403
## UCd UCr
## 1 0.7537104 42.60187
## 2 0.9789506 41.30757
## 3 0.1903262 36.47716
## 4 0.9364825 42.47987
## 5 1.4977829 47.78528
## 6 1.6645837 38.26386
Filtering and Subsetting Data using Tidyverse Syntax
To subset columns in tidyverse, run the following:
= full.data.tidy %>% select(all_of(subset.columns))
subset.tidy1 head(subset.tidy1)
## BMI MAge MEdu
## 1 27.7 22.99928 3
## 2 26.8 30.05142 3
## 3 33.2 28.04660 3
## 4 30.1 34.81796 3
## 5 37.4 42.68440 3
## 6 33.3 24.94960 3
Note that you can also include column identifiers that may get dropped in the subsetting vector here:
# Note that we're including a 'fake' column here 'NotAColName' to illustrate
# how to incorporate additional columns; though this column gets dropped in
# the next line of code
<- c(subset.columns, "NotAColName")
subset.columns2
# Viewing this new vector
subset.columns2
## [1] "BMI" "MAge" "MEdu" "NotAColName"
<- full.data.tidy %>% select(any_of(subset.columns2))
subset.tidy2
# Viewing the top of this new dataframe
head(subset.tidy2)
## BMI MAge MEdu
## 1 27.7 22.99928 3
## 2 26.8 30.05142 3
## 3 33.2 28.04660 3
## 4 30.1 34.81796 3
## 5 37.4 42.68440 3
## 6 33.3 24.94960 3
Note that the ‘fake’ column ‘NotAColName’ gets automatically dropped here
To remove columns using tidyverse, you can run the following:
# Removing columns
<- full.data.tidy %>% select(-subset.columns)
subset.tidy3
# Viewing this new dataframe
head(subset.tidy3)
## ID BW GA DWAs DWCd DWCr UAs UCd UCr
## 1 1 3180.058 34 6.426464 1.292941 51.67987 10.192695 0.7537104 42.60187
## 2 2 3210.823 43 7.832384 1.798535 50.10409 11.815088 0.9789506 41.30757
## 3 3 3311.551 40 7.516569 1.288461 48.74001 10.079057 0.1903262 36.47716
## 4 4 3266.844 32 5.906656 2.075259 50.92745 8.719123 0.9364825 42.47987
## 5 5 3664.088 35 7.181873 2.762643 55.16882 9.436559 1.4977829 47.78528
## 6 6 3328.988 40 9.723429 3.054057 51.14812 11.589403 1.6645837 38.26386
Subsetting rows using tidyverse:
# Selecting to retain the first 100 rows
<- full.data.tidy %>% slice(1:100)
subset.tidy4 dim(subset.tidy4)
## [1] 100 12
# Selecting to remove the first 100 rows
<- full.data.tidy %>% slice(-c(1:100))
subset.tidy5 dim(subset.tidy5)
## [1] 100 12
Filtering data based on conditional statements using tidyverse:
<- full.data.tidy %>% filter(BMI > 25 & MAge > 31)
subset.tidy6 dim(subset.tidy6)
## [1] 49 12
Another example of a conditional statement that can be used to filter data:
<- full.data.tidy %>% filter(BMI > 25 & MAge > 31) %>% select(BMI, MAge, MEdu) subset.tidy7
Melting and Casting Data using Tidyverse Syntax
To melt and cast data in tidyverse, you can use the ‘pivot’ functions (i.e., ‘pivot_longer’ or ‘pivot_wider’). These are exemplified below.
Melting to long format using tidyverse:
<- full.data.tidy %>% pivot_longer(-ID, names_to = "var", values_to = "value")
full.pivotlong head(full.pivotlong, 15)
## # A tibble: 15 × 3
## ID var value
## <int> <chr> <dbl>
## 1 1 BMI 27.7
## 2 1 MAge 23.0
## 3 1 MEdu 3
## 4 1 BW 3180.
## 5 1 GA 34
## 6 1 DWAs 6.43
## 7 1 DWCd 1.29
## 8 1 DWCr 51.7
## 9 1 UAs 10.2
## 10 1 UCd 0.754
## 11 1 UCr 42.6
## 12 2 BMI 26.8
## 13 2 MAge 30.1
## 14 2 MEdu 3
## 15 2 BW 3211.
Casting to wide format using tidyverse:
<- full.pivotlong %>% pivot_wider(names_from = "var", values_from="value")
full.pivotwide head(full.pivotwide)
## # A tibble: 6 × 12
## ID BMI MAge MEdu BW GA DWAs DWCd DWCr UAs UCd UCr
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 27.7 23.0 3 3180. 34 6.43 1.29 51.7 10.2 0.754 42.6
## 2 2 26.8 30.1 3 3211. 43 7.83 1.80 50.1 11.8 0.979 41.3
## 3 3 33.2 28.0 3 3312. 40 7.52 1.29 48.7 10.1 0.190 36.5
## 4 4 30.1 34.8 3 3267. 32 5.91 2.08 50.9 8.72 0.936 42.5
## 5 5 37.4 42.7 3 3664. 35 7.18 2.76 55.2 9.44 1.50 47.8
## 6 6 33.3 24.9 3 3329. 40 9.72 3.05 51.1 11.6 1.66 38.3