6.2 -Omics and System Biology: Transcriptomic Applications

This training module was developed by Lauren E. Koval, Dr. Kyle Roell, and Dr. Julia E. Rager.

All input files (script, data, and figures) can be downloaded from the UNC-SRP TAME2 GitHub website.

Introduction to Training Module

This training module incorporates the highly relevant example of RNA sequencing to evaluate the impacts of environmental exposures on cellular responses and general human health. RNA sequencing is the most common method that is currently implemented to measure the transcriptome. Results from an RNA sequencing platform are often summarized as count data, representing the number of relative times a gene (or other annotated portion of the genome) was ‘read’ in a given sample. For more details surrounding the methodological underpinnings of RNA sequencing, see the following recent review:

Stark R, Grzelak M, Hadfield J. RNA sequencing: the teenage years. Nat Rev Genet. 2019 Nov;20(11):631-656. doi: 10.1038/s41576-019-0150-2. Epub 2019 Jul 24. PMID: 31341269.

In this training module, we guide participants through an example RNA sequencing analysis. Here, we analyze RNA sequencing data collected in a toxicology study evaluating the effects of biomass smoke exposure, representing wildfire-relevant exposure conditions. This study has been previously been described in the following publications:

Rager JE, Clark J, Eaves LA, Avula V, Niehoff NM, Kim YH, Jaspers I, Gilmour MI. Mixtures modeling identifies chemical inducers versus repressors of toxicity associated with wildfire smoke. Sci Total Environ. 2021 Jun 25;775:145759. doi: 10.1016/j.scitotenv.2021.145759. Epub 2021 Feb 10. PMID: 33611182.
Kim YH, Warren SH, Krantz QT, King C, Jaskot R, Preston WT, George BJ, Hays MD, Landis MS, Higuchi M, DeMarini DM, Gilmour MI. Mutagenicity and Lung Toxicity of Smoldering vs. Flaming Emissions from Various Biomass Fuels: Implications for Health Effects from Wildland Fires. Environ Health Perspect. 2018 Jan 24;126(1):017011. doi: 10.1289/EHP2200. PMID: 29373863.

Here, we specifically analyze mRNA sequencing profiles collected in mouse lung tissues. These mice were exposed to two different biomass burn scenarios: smoldering pine needles and flaming pine needles, representing certain wildfire smoke exposure scenarios that can occur. The goal of these analyses is to identify which genes demonstrate altered expression in response to these wildfire-relevant exposures, and identify which biological pathways these genes influence to evaluate findings at the systems biology level.

This training module begins by guiding users through the loading, viewing, and formatting of the example transcriptomics datasets and associated metadata. Methods to carry out quality assurance (QA) / quality control (QC) of the transcriptomics data are then described, which are advantageous to ensure high quality data are included in the final statistical analysis. Because these transcriptomic data were derived from bulk lung tissue samples, consisting of mixed cell populations that could have shifted in response to exposures, data are then adjusted for potential sources of heterogeneity using the R package RUVseq.

Statistical models are then implemented to identify genes that were significantly differentially expressed between exposed vs unexposed samples. Models are implemented using algorithms within the commonly implemented R package DESeq2. This package is very convenient, well written, and widely used. The main advantage of this package is that is allows you to perform differential expression analyses and easily obtain various statistics and results with minimal script development on the user-end.

After obtaining results from differential gene expression analyses, we visualize these results using both MA and volcano plots. Finally, we carry out a systems level analysis through pathway enrichment using the R package PIANO to identify which biological pathways were altered in response to these wildfire-relevant exposure scenarios.

Introduction to the Field of “-Omics”

The field of “-omics” has rapidly evolved since its inception in the mid-1990’s, initiated from information obtained through sequencing of the human genome (see the Human Genome Project) as well as the advent of high-content technologies. High-content technologies have allowed the rapid and economical assessment of genome-wide, or ‘omics’-based, endpoints.

Traditional molecular biology techniques typically evaluate the function(s) of individual genes and gene products. Omics-based methods, on the other hand, utilize non-targeted methods to identify many to all genes or gene products in a given environmental/biological sample. These non-targeted approaches allow for the unbiased investigation of potentially unknown or understudied molecular mediators involved in regulating cell health and disease. These molecular profiles have the potential of being altered in response to toxicant exposures and/or during disease initiation/progression.

To further understand the molecular consequences of -omics-based alterations, molecules can be overlaid onto molecular networks to uncover biological pathways and molecular functions that are perturbed at the systems biology level. An overview of these generally methods, starting with high-content technologies and ending of systems biology, is provided in the below figure (created with BioRender.com).

A helpful introduction to the field of -omics in relation to environmental health, as well as methods used to relate -omic-level alterations to systems biology, is provided in the following book chapter:

Rager JE, Fry RC. Systems Biology and Environmental Exposures. Chpt 4 of ‘Network Biology’ edited by WenJun Zhang. 2013. ISBN: 978-1-62618-941-3. Nova Science Publishers, Inc. Available at: https://www.novapublishers.com/wp-content/uploads/2019/07/978-1-62618-942-3_ch4.pdf.

An additional helpful resource describing computational methods that can be used in systems level analyses is the following book chapter:

Meisner M, Reif DM. Computational Methods Used in Systems Biology. Chpt 5 of ‘Systems Biology in Toxicology and Environmental Health’ edited by Fry RC. 2015: 85-115. ISBN 9780128015643. Academic Press. Available at: https://www.sciencedirect.com/science/article/pii/B9780128015643000055.

Parallel to human genomics/epigenomics-based research is the newer “-omics” topic of the exposome. The exposome was originally conceptualized as ’all life-course environmental exposures (including lifestyle factors), from the prenatal period onwards (Wild et al. 2005). Since then, this concept has received much attention and additional associated definitions. We like to think of the exposome as including anything in ones environment that may impact the overall health of an individual, excluding the individual’s genome/epigenome. Common elements evaluated as part of the exposome include environmental exposures, such as chemicals and other substances that may impart toxicity. Additional potential stressors include lifestyle factors, socioeconomic factors, infectious agents, therapeutics, and other stressors that may be altered internally (e.g., microbiome). A helpful review of this research field is provided as the following publication:

Wild CP. The exposome: from concept to utility. Int J Epidemiol. 2012 Feb;41(1):24-32. doi: 10.1093/ije/dyr236. Epub 2012 Jan 31. PMID: 22296988.

Introduction to Transcriptomics

One of the most widely evaluated -omics endpoints is messenger RNA (mRNA) expression (also termed gene expression). As a reminder, mRNA molecules are a major type of RNA produced as the “middle step” in the Central Dogma Theory, which describes how genetic DNA is first transcribed into RNA and then translated into protein. Protein molecules are ultimately the major regulators of cellular processes and overall health. Therefore, any perturbations to this process (including changes to mRNA expression levels) can have tremendous consequences on overall cell function and health. A visualization of these steps in the Central Dogma theory are included below.

mRNA expression can be evaluated in a high-throughout/high-content manner, across the genome, and is referred to as the transcriptome when doing so. Transcriptomics can be measured using a variety of technologies, including high-density nucleic acid arrays (e.g., DNA microarrays or GeneChip arrays), high-throughput PCR technologies, or RNA sequencing technologies. These methods are used to obtain relative measures of genes that are being expressed or transcribed from DNA by measuring the abundance of mRNA molecules. Results of these methods are often termed as providing gene expression signatures or ‘transcriptomes’ of a sample under evaluation.

Training Module’s Environmental Health Questions

This training module was specifically developed to answer the following environmental health questions:

What two types of data are commonly needed in the analysis of transcriptomics data?
When preparing transcriptomics data for statistical analyses, what are three common data filtering steps that are completed during the data QA/QC process?
When identifying potential sample outliers in a typical transcriptomics dataset, what two types of approaches are commonly employed to identify samples with outlying data distributions?
What is an approach that analysts can use when evaluating transcriptomic data from tissues of mixed cellular composition to aid in controlling for sources of sample heterogeneity?
How many genes showed significant differential expression associated with flaming pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
How many genes showed significant differential expression associated with smoldering pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
How many genes showed significant differential expression associated with lipopolysaccharide (LPS) exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?
What biological pathways are disrupted in association with flaming pine needles exposure in the lung, identified through systems level analyses?

Workspace Preparation and Data Import

Installing required R packages

If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you

if (!requireNamespace("tidyverse"))
   install.packages("tidyverse");
if (!requireNamespace("BiocManager"))
   BiocManager::install("BiocManager");
if (!requireNamespace("DESeq2"))
   BiocManager::install("DESeq2");
if (!requireNamespace("edgeR"))
   BiocManager::install("edgeR");
if (!requireNamespace("RUVSeq"))
   BiocManager::install("RUVSeq");
if (!requireNamespace("janitor"))
    install.packages("janitor");
if (!requireNamespace("pheatmap"))
   install.packages("pheatmap");
if (!requireNamespace("factoextra"))
    install.packages("factoextra");
if (!requireNamespace("RColorBrewer"))
     install.packages("RColorBrewer");
if (!requireNamespace("data.table"))
    install.packages("data.table");
if (!requireNamespace("EnhancedVolcano"))
   BiocManager::install("EnhancedVolcano");
if (!requireNamespace("piano"))
   BiocManager::install("piano");

Loading R packages required for this session

library(tidyverse)
library(DESeq2)
library(edgeR)
library(RUVSeq)
library(janitor)
library(factoextra)
library(pheatmap)
library(data.table)
library(RColorBrewer)
library(EnhancedVolcano)
library(piano)

Set your working directory

setwd("/filepath to where your input files are")

Loading the Example Transcriptomic Dataset and Metadata

First, let’s read in the transcriptional signature data, previously summarized as number of sequence reads per gene (also simply referred to as ‘count data’) and its associated metadata file:

# Read in the count data
countdata <- read.csv(file = 'Chapter_6/Module6_2_Input/Module6_2_InputData1_GeneCounts.csv', check.names = FALSE)

# Read in the metadata (describing information on each sample)
sampleinfo <- read.csv(file = "Chapter_6/Module6_2_Input/Module6_2_InputData2_SampleInfo.csv", check.names = FALSE)

Data Viewing

Let’s see how many rows and columns of data are present in the countdata dataframe

dim(countdata)

## [1] 30146    23

Let’s also view the column headers

colnames(countdata)

##  [1] "Gene"           "Plate1-m13-RNA" "Plate2-m14-RNA" "Plate2-m15-RNA"
##  [5] "Plate2-m16-RNA" "Plate2-m17-RNA" "Plate2-m18-RNA" "Plate1-m49-RNA"
##  [9] "Plate1-m50-RNA" "Plate1-m51-RNA" "Plate1-m52-RNA" "Plate1-m53-RNA"
## [13] "Plate1-m54-RNA" "Plate1-m31-RNA" "Plate1-m32-RNA" "Plate1-m33-RNA"
## [17] "Plate1-m34-RNA" "Plate1-m35-RNA" "Plate1-m36-RNA" "Plate1-m67-RNA"
## [21] "Plate1-m68-RNA" "Plate1-m69-RNA" "Plate1-m70-RNA"

And finally let’s view the top few rows of data

head(countdata)

##           Gene Plate1-m13-RNA Plate2-m14-RNA Plate2-m15-RNA Plate2-m16-RNA
## 1  Bcat1_29039              0              0              0              0
## 2 Laptm5_29040              0              2              0              1
## 3   Skp2_29041              0              0              0              0
## 4   Nme1_29042            765           1041           1028            856
## 5   Ctsd_29043             25             30             38             24
## 6  Camk1_29044            443            445            433            628
##   Plate2-m17-RNA Plate2-m18-RNA Plate1-m49-RNA Plate1-m50-RNA Plate1-m51-RNA
## 1              0              0              0              0              0
## 2              2              1              0              1              2
## 3              0              0              0              0              0
## 4            888            967            816            569            854
## 5             17             19             38             38             34
## 6            497            684            574             71            356
##   Plate1-m52-RNA Plate1-m53-RNA Plate1-m54-RNA Plate1-m31-RNA Plate1-m32-RNA
## 1              0              2              0              0              3
## 2              2              7              1              2              1
## 3              0              0              0              0              0
## 4            736            887            930            563            776
## 5             30             24             23             29             21
## 6            403            179            302            172            487
##   Plate1-m33-RNA Plate1-m34-RNA Plate1-m35-RNA Plate1-m36-RNA Plate1-m67-RNA
## 1              0              0              0              0              1
## 2              5              0              0              0              1
## 3              0              0              0              0              0
## 4            604           1008            724            632            950
## 5             27             26             18             22             33
## 6            487            429            599            683            147
##   Plate1-m68-RNA Plate1-m69-RNA Plate1-m70-RNA
## 1              0              0              0
## 2              3              2              2
## 3              0              0              0
## 4            782            879            860
## 5             19             12             14
## 6            265            335            454

Together, this dataframe contains information across 30146 mRNA identifiers, that are labeled according to “Gene name” followed by an underscore and probe number assigned by the platform used in this analysis, BioSpyder TempoSeq Technologies.

A total of 23 columns are included in this dataframe, the first of which represents the gene identifier, followed by gene count data across 22 samples.

Let’s also see what the metadata dataframe looks like

dim(sampleinfo)

## [1] 22  9

Let’s also view the column headers

colnames(sampleinfo)

## [1] "SampleID_BioSpyderCountFile" "PlateBatch"                 
## [3] "MouseID"                     "NumericID"                  
## [5] "Treatment"                   "ID"                         
## [7] "Timepoint"                   "Tissue"                     
## [9] "Group"

And finally let’s view the top few rows of data

head(sampleinfo)

##   SampleID_BioSpyderCountFile PlateBatch MouseID NumericID          Treatment
## 1              Plate1-m13-RNA     Plate1     M13        13 PineNeedlesSmolder
## 2              Plate2-m14-RNA     Plate2     M14        14 PineNeedlesSmolder
## 3              Plate2-m15-RNA     Plate2     M15        15 PineNeedlesSmolder
## 4              Plate2-m16-RNA     Plate2     M16        16 PineNeedlesSmolder
## 5              Plate2-m17-RNA     Plate2     M17        17 PineNeedlesSmolder
## 6              Plate2-m18-RNA     Plate2     M18        18 PineNeedlesSmolder
##                       ID Timepoint Tissue                      Group
## 1 M13_PineNeedlesSmolder        4h   Lung PineNeedlesSmolder_4h_Lung
## 2 M14_PineNeedlesSmolder        4h   Lung PineNeedlesSmolder_4h_Lung
## 3 M15_PineNeedlesSmolder        4h   Lung PineNeedlesSmolder_4h_Lung
## 4 M16_PineNeedlesSmolder        4h   Lung PineNeedlesSmolder_4h_Lung
## 5 M17_PineNeedlesSmolder        4h   Lung PineNeedlesSmolder_4h_Lung
## 6 M18_PineNeedlesSmolder        4h   Lung PineNeedlesSmolder_4h_Lung

Together, this dataframe contains information across the 22 total samples, that are labeled according to “SampleID_BioSpyderCountFile” header. These identifiers match those used as column headers in the countdata dataframe.

A total of 9 columns are included in this dataframe, including the following:

SampleID_BioSpyderCountFile: The unique sample identifers (total n=22)
PlateBatch: The plate number that was used in the generation of these data.
‘MouseID’: The unique identifier, that starts with “M” followed by a number, for each mouse used in this study
NumericID: The unique numeric identifier for each mouse.
Treatment: The type of exposure condition that each mouse was administered. These include smoldering pine needles, flaming pine needles, vehicle control (saline), and positive inflammation control (LPS, or lipopolysaccharide)
ID: Another form of identifier that combines the mouse identifier with the exposure condition
Timepoint: The timepoint at which samples were collected (here, all 4h post-exposure)
Tissue: The type of tissue that was collected and analyzed (here, all lung tissue)
Group: The higher level identifier that groups samples together based on exposure condition, timepoint, and tissue

Checking for Duplicate mRNA IDs

One common QC/preparation step that is helpful when organizing transcriptomics data is to check for potential duplicate mRNA IDs in the countdata.

# Visualize this data quickly by viewing top left corner, to check where ID column is located:
countdata[1:3,1:5]

##           Gene Plate1-m13-RNA Plate2-m14-RNA Plate2-m15-RNA Plate2-m16-RNA
## 1  Bcat1_29039              0              0              0              0
## 2 Laptm5_29040              0              2              0              1
## 3   Skp2_29041              0              0              0              0

# Then check for duplicates within column 1 (where the ID column is located):
Dups <- duplicated(countdata[,1])
summary(Dups)

##    Mode   FALSE 
## logical   30146

In this case, because all potential duplicate checks turn up “FALSE”, these data do not contain duplicate mRNA identifiers in its current organized format.

Answer to Environmental Health Question 1

With this, we can now answer Environmental Health Question #1: What two types of data are commonly needed in the analysis of transcriptomics data?

Answer: A file containing the raw -omics signatures are needed (in this case, the count data summarized per gene acquired from RNA sequencing technologies), and a file containing the associated metadata describing the actual samples, where they were derived from, what they represent, etc, is needed.

Formatting Data for Downstream Statistics

Most of the statistical analyses included in this training module will be carried out using the DESeq2 pipeline. This package requires that the count data and sample information data be formatted in a certain manner, which will expedite the downstream coding needed to carry out the statistics. Here, we will walk users through these initial formatting steps.

DESeq2 first requires a coldata dataframe, which includes the sample information (i.e., metadata). Let’s create this new dataframe based on the original sampleinfo dataframe:

coldata <- sampleinfo

DESeq2 also requires a countdata dataframe, which we’ve previously created; however, this dataframe requires some minor formatting before it can be used as input for downstream script.

First, the gene identifiers need to be converted into row names:

countdata <- countdata %>% column_to_rownames("Gene")

Then, the column names need to be edited. Let’s remind ourselves what the column names are currently:

colnames(countdata)

##  [1] "Plate1-m13-RNA" "Plate2-m14-RNA" "Plate2-m15-RNA" "Plate2-m16-RNA"
##  [5] "Plate2-m17-RNA" "Plate2-m18-RNA" "Plate1-m49-RNA" "Plate1-m50-RNA"
##  [9] "Plate1-m51-RNA" "Plate1-m52-RNA" "Plate1-m53-RNA" "Plate1-m54-RNA"
## [13] "Plate1-m31-RNA" "Plate1-m32-RNA" "Plate1-m33-RNA" "Plate1-m34-RNA"
## [17] "Plate1-m35-RNA" "Plate1-m36-RNA" "Plate1-m67-RNA" "Plate1-m68-RNA"
## [21] "Plate1-m69-RNA" "Plate1-m70-RNA"

These column identifiers need to be converted into more intuitive sample IDs, that also indicate treatment. This information can be found in the coldata dataframe. Specifically, information in the column labeled SampleID_BioSpyderCountFile will be helpful for these purposes.

To replace these original column identifiers with these more helpful sample identifiers, let’s first make sure the order of the countdata columns are in the same order as the coldata column of SampleID_BioSpyderCountFile:

countdata <- setcolorder(countdata, as.character(coldata$SampleID_BioSpyderCountFile))

Now, we can rename the column names within the countdata dataframe with these more helpful identifiers, since both dataframes are now arranged in the same order:

colnames(countdata) <- coldata$ID # Rename the countdata column names with the treatment IDs. 
colnames(countdata) # Viewing these new column names

##  [1] "M13_PineNeedlesSmolder" "M14_PineNeedlesSmolder" "M15_PineNeedlesSmolder"
##  [4] "M16_PineNeedlesSmolder" "M17_PineNeedlesSmolder" "M18_PineNeedlesSmolder"
##  [7] "M49_PineNeedlesFlame"   "M50_PineNeedlesFlame"   "M51_PineNeedlesFlame"  
## [10] "M52_PineNeedlesFlame"   "M53_PineNeedlesFlame"   "M54_PineNeedlesFlame"  
## [13] "M31_Saline"             "M32_Saline"             "M33_Saline"            
## [16] "M34_Saline"             "M35_Saline"             "M36_Saline"            
## [19] "M67_LPS"                "M68_LPS"                "M69_LPS"               
## [22] "M70_LPS"

These new column identifiers look much better, and can better inform downstream statistical analysis script. Remember that these identifiers indicate that these are mouse samples (“M”), with unique numbers, followed by an underscore and the exposure condition.

When relabeling dataframes, it’s always important to triple check any of these major edits. For example, here, let’s double check that the same samples appear in the same order between the two working dataframes required for dowstream DESeq2 code:

setequal(as.character(coldata$ID), colnames(countdata))

## [1] TRUE

identical(as.character(coldata$ID), colnames(countdata))

## [1] TRUE

Transcriptomics Data QA/QC

After preparing your transcriptomic data and sample information dataframes for statistical analyses, it is very important to carry out QA/QC on your organized datasets, prior to including all samples and all genes in the actual statistical model. It is critical to only include high quality data that inform underlying biology of exposure responses/disease etiology, rather than data that may contribute noise to the overall data distributions. Some common QA/QC steps and associated data pre-filters carried out in transcriptomics analyses are detailed below.

Background Filter

It is very common to perform a background filter step when preparing transcriptomic data for statistical analyses. The goal of this step is to remove genes that are very lowly expressed across the majority of samples, and thus are referred to as universally lowly expressed. Signals from these genes can mute the overall signals that may be identified in -omics analyses. The specific threshold that you may want to apply as the background filter to your dataset will depend on the distribution of your dataset and analysis goal(s).

For this example, we apply a background threshold, to remove genes that are lowly expressed across the majority of samples, specifically defined as genes that have expression levels across at least 20% of the samples that are less than (or equal to) the median expression of all genes across all samples. This will result in including only genes that are expressed above background, that have expression levels in at least 20% of samples that are greater than the overall median expression. Script to apply this filter is detailed below:

# First count the total number of samples, and save it as a value in the global environment
nsamp <- ncol(countdata)

# Then, calculate the median expression level across all genes and all samples, and save it as a value
total_median <- median(as.matrix(countdata))


# We need to temporarily add back in the Gene column to the countdata so we can filter for genes that pass the background filter
countdata <- countdata %>% rownames_to_column("Gene")

# Then we can apply a set of filters and organization steps (using the tidyverse) to result in a list of genes that have an expression greater than the total median in at least 20% of the samples
genes_above_background <- countdata %>% # Start from the 'countdata' dataframe
  # Melt the data so that we have three columns: gene, exposure condition, and expression counts
  pivot_longer(cols=!Gene, names_to = "sampleID", values_to="expression") %>% 
  # Add a column that indicates whether the expression of a gene for the corresponding exposure condition is above (1) or not above (0) the median of all count data
  mutate(above_median=ifelse(expression>total_median,1,0)) %>% 
  group_by(Gene) %>% # Group the dataframe by the gene
  # For each gene, count the number of exposure conditions where the expression was greater than the median of all count data
  summarize(total_above_median=sum(above_median)) %>% 
  # Filter for genes that have expression above the median in at least 20% of the samples
  filter(total_above_median>=.2*nsamp) %>% 
  # Select just the genes that pass the filter
  select(Gene) 

# Then filter the original 'countdata' dataframe for only the genes above background. 
countdata <- left_join(genes_above_background, countdata, by="Gene")

Here, the countdata dataframe went from having 30,146 rows of data (representing genes) to 16,664 rows of data (representing genes with expression levels that passed this background filter)

Sample Filtering

Another common QA/QC check is to evaluate whether there are any samples that did not produce adequate RNA material to be measured using the technology employed. Thus, a sample filter can be applied to remove samples that have inadequate data. Here, we demonstrate this filter by checking to see whether there were any samples that resulted in mRNA expression values of zero across all genes. If any sample demonstrates this issue, it should be removed prior to any statistical analysis. Note, there are other filter cut-offs you can use depending on your specific study.

Below is example script that checks for the presence of samples that meet the above criteria:

# Transpose filtered 'countdata', while keeping data in dataframe format, to allow for script that easily sums the total expression levels per sample
countdata_T <- countdata %>% 
  pivot_longer(cols=!Gene, names_to="sampleID",values_to="expression") %>% 
  pivot_wider(names_from=Gene, values_from=expression)

# Then add in a column to the transposed countdata dataframe that sums expression across all genes for each exposure condition
countdata_T$rowsum <- rowSums(countdata_T[2:ncol(countdata_T)])

# Remove samples that have no expression. All samples have some expression in this example, so all samples are retained.
countdata_T <- countdata_T %>% filter(rowsum!=0)

# Take the count data filtered for correct samples, remove the 'rowsums' column
countdata_T <- countdata_T %>% select(!rowsum) 

# Then, transpose it back to the correct format for analysis
countdata <- countdata_T %>%
  pivot_longer(cols=!sampleID, names_to = "Gene",values_to="expression") %>% 
  pivot_wider(names_from = sampleID, values_from = "expression")

Identifying & Removing Sample Outliers

Prior to final statistical analysis, raw transcriptomic data are commonly evaluated for the presence of potential sample outliers. Outliers can result from experimental error, technical error/measurement error, and/or huge sources of variation in biology. For many analyses, it is beneficial to remove such outliers to enhance computational abilities to identify biologically meaningful signals across data. Here, we present two methods to check for the presence of sample outliers:

1. Principal component analysis (PCA) can be used to identify potential outliers in a dataset through visualization of summary-level values illustrating reduced representations of the entire dataset. Note that a more detailed description of PCA is provided in TAME 2.0 Module 5.4 Unsupervised Machine Learning Part 1: K-Means & PCA. Here, PCA is run on the raw count data and further analyzed using scree plots, assessing principal components (PCs), and visualized using biplots displaying the first two principal components as a scatter plot.

2. Hierarchical clustering is another approach that can be used to identify potential outliers. Hierarchical clustering aims to cluster data based on a similarity measure, defined by the function and/or specified by the user. There are several R packages and functions that will run hierarchical clustering, but it is often helpful visually to do this in conjuction with a heatmap. Here, we use the package pheatmap (introduced in TAME 2.0 Module 5.5 Unsupervised Machine Learning Part 2: Additional Clustering Applications) with hierarchical clustering across samples to identify potential outliers.

Let’s start by using PCA to identify potential outliers, while providing a visualization of potential sources of variation across the dataset.

First we need to move the Gene column back to the rownames so our dataframe is numeric and we can run the PCA script

countdata <- countdata %>% column_to_rownames("Gene")

# Let's remind ourselves what these data look like
countdata[1:10,1:5] #viewing first 10 rows and 5 columns

##                     M13_PineNeedlesSmolder M14_PineNeedlesSmolder
## 0610009B22Rik_56046                    123                    118
## 0610010F05Rik_69119                   1378                   1511
## 0610010F05Rik_74637                     83                     55
## 0610010K14Rik_31619                    462                    580
## 0610010K14Rik_31873                    560                    814
## 0610010K14Rik_68949                    207                    182
## 0610012G03Rik_58446                    243                    299
## 0610030E20Rik_65200                    152                    105
## 0610040J01Rik_55628                    125                    106
## 1110004F10Rik_77756                    324                    377
##                     M15_PineNeedlesSmolder M16_PineNeedlesSmolder
## 0610009B22Rik_56046                    110                    167
## 0610010F05Rik_69119                   1534                    936
## 0610010F05Rik_74637                     57                     13
## 0610010K14Rik_31619                    625                    560
## 0610010K14Rik_31873                    680                    799
## 0610010K14Rik_68949                    104                    202
## 0610012G03Rik_58446                    294                    327
## 0610030E20Rik_65200                     92                     87
## 0610040J01Rik_55628                    127                    107
## 1110004F10Rik_77756                    356                    233
##                     M17_PineNeedlesSmolder
## 0610009B22Rik_56046                    147
## 0610010F05Rik_69119                   1301
## 0610010F05Rik_74637                     54
## 0610010K14Rik_31619                    604
## 0610010K14Rik_31873                    707
## 0610010K14Rik_68949                    166
## 0610012G03Rik_58446                    262
## 0610030E20Rik_65200                    150
## 0610040J01Rik_55628                     93
## 1110004F10Rik_77756                    276

Then we can calculate principal components using transposed count data

pca <- prcomp(t(countdata))

And visualize the percent variation captured by each principal component (PC) with a scree plot

# We can generate a scree plot that shows the eigenvalues of each component, indicating how much of the total variation is captured by each component
fviz_eig(pca, addlabels = TRUE)

## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

This scree plot indicates that nearly all variation is explained in PC1 and PC2, so we are comfortable with viewing these first two PCs when evaluating whether or not potential outliers exist in this dataset.

Visualization of Transcriptomic Data using PCA

Further visualization of how these transcriptomic data appear through PCA can be produced through a scatter plot showing the data reduced values per sample:

# Calculate the percent variation captured by each PC
pca_percent <- round(100*pca$sdev^2/sum(pca$sdev^2),1)

# Make dataframe for PCA plot generation using first two components and the sample name
pca_df <- data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2], Sample=colnames(countdata))

# Organize dataframe so we can color our points by the exposure condition
pca_df <- pca_df %>% separate(Sample, into = c("mouse_num", "expo_cond"), sep="_")

# Plot PC1 and PC2 for each sample and color the point by the exposure condition
ggplot(pca_df, aes(PC1,PC2, color = expo_cond))+
  geom_hline(yintercept = 0, size=0.3)+
  geom_vline(xintercept = 0, size=0.3)+
  geom_point(size=3) +
  geom_text(aes(label=mouse_num), vjust =-1, size=4)+
  labs(x=paste0("PC1 (",pca_percent[1],"%)"), y=paste0("PC2 (",pca_percent[2],"%)"))+
  ggtitle("PCA for 4h Lung Pine Needles & Control Exposure Conditions")

With this plot, we can see that samples do not demonstrate obvious groupings, where certain samples group far apart from others. Therefore, our PCA analysis indicates that there are unlikely any sample outliers in this dataset.

Now lets implement hierarchical clustering to identify potential outliers

First we need to create a dataframe of our transposed countdata such that samples are rows and genes are columns to input into the clustering algorithm.

countdata_for_clustering <- t(countdata)
countdata_for_clustering[1:5,1:10] # Viewing what this transposed dataframe looks like

##                        0610009B22Rik_56046 0610010F05Rik_69119
## M13_PineNeedlesSmolder                 123                1378
## M14_PineNeedlesSmolder                 118                1511
## M15_PineNeedlesSmolder                 110                1534
## M16_PineNeedlesSmolder                 167                 936
## M17_PineNeedlesSmolder                 147                1301
##                        0610010F05Rik_74637 0610010K14Rik_31619
## M13_PineNeedlesSmolder                  83                 462
## M14_PineNeedlesSmolder                  55                 580
## M15_PineNeedlesSmolder                  57                 625
## M16_PineNeedlesSmolder                  13                 560
## M17_PineNeedlesSmolder                  54                 604
##                        0610010K14Rik_31873 0610010K14Rik_68949
## M13_PineNeedlesSmolder                 560                 207
## M14_PineNeedlesSmolder                 814                 182
## M15_PineNeedlesSmolder                 680                 104
## M16_PineNeedlesSmolder                 799                 202
## M17_PineNeedlesSmolder                 707                 166
##                        0610012G03Rik_58446 0610030E20Rik_65200
## M13_PineNeedlesSmolder                 243                 152
## M14_PineNeedlesSmolder                 299                 105
## M15_PineNeedlesSmolder                 294                  92
## M16_PineNeedlesSmolder                 327                  87
## M17_PineNeedlesSmolder                 262                 150
##                        0610040J01Rik_55628 1110004F10Rik_77756
## M13_PineNeedlesSmolder                 125                 324
## M14_PineNeedlesSmolder                 106                 377
## M15_PineNeedlesSmolder                 127                 356
## M16_PineNeedlesSmolder                 107                 233
## M17_PineNeedlesSmolder                  93                 276

Next we can run hierarchical clustering in conjunction with the generation of a heatmap. Note that we scale these data for improved visualization.

pheatmap(scale(countdata_for_clustering), main="Hierarchical Clustering",
         cluster_rows=TRUE, cluster_cols = FALSE,
         fontsize_col = 7, treeheight_row = 60, show_colnames = FALSE)

Like the PCA findings, heirarchical clustering demonstrated an overall lack of potential sample outliers because there were no obvious sample(s) that grouped separately from the rest along the clustering dendograms.
Therefore, neither approach points to outliers that should be removed in this analysis.

Answer to Environmental Health Question 2

With this, we can now answer Environmental Health Question #2: When preparing transcriptomics data for statistical analyses, what are three common data filtering steps that are completed during the data QA/QC process?

Answer: (1) Background filter to remove genes that are universally lowly expressed; (2) Sample filter to remove samples that may be not have any detectable mRNA; (3) Sample outlier filter to remove samples with underlying data distributions outside of the overall, collective dataset.*

Answer to Environmental Health Question 3

With this, we can now also answer Environmental Health Question #3: When identifying potential sample outliers in a typical transcriptomics dataset, what two types of approaches are commonly employed to identify samples with outlying data distributions?

Answer: Principal component analysis (PCA) and hierarchical clustering.

Controlling for Sources of Sample Heterogeneity

Because these transcriptomic data were generated from mouse lung tissues, there is potential for these samples to show heterogeneity based on underlying shifts in cell populations (e.g., neutrophil influx) or other aspects of sample heterogeneity (e.g., batch effects from plating, among other sources of heterogeneity that we may want to control for). For these kinds of complex samples, there are data processing methods that can be leveraged to minimize the influence of these sources of heterogeneity. Example methods include Remove Unwanted Variable (RUV), which is discussed here, as well as others (e.g., Surrogate Variable Analysis (SVA)).

Here, we leverage the package called RUVseq to employ RUV on this sequencing dataset. Script was developed based off Bioconductor website, vignette, and original publication.

Steps in carrying out RUV using RUVseq on this example dataset:

# First we store the treatment IDs and exposure conditions as a separate vector
ID <- coldata$ID

# And differentiate our treatments and control conditions, first by grabbing the groups associated with each sample
groups <- as.factor(coldata$Group)

# Let's view all the groups
groups

##  [1] PineNeedlesSmolder_4h_Lung PineNeedlesSmolder_4h_Lung
##  [3] PineNeedlesSmolder_4h_Lung PineNeedlesSmolder_4h_Lung
##  [5] PineNeedlesSmolder_4h_Lung PineNeedlesSmolder_4h_Lung
##  [7] PineNeedlesFlame_4h_Lung   PineNeedlesFlame_4h_Lung  
##  [9] PineNeedlesFlame_4h_Lung   PineNeedlesFlame_4h_Lung  
## [11] PineNeedlesFlame_4h_Lung   PineNeedlesFlame_4h_Lung  
## [13] Saline_4h_Lung             Saline_4h_Lung            
## [15] Saline_4h_Lung             Saline_4h_Lung            
## [17] Saline_4h_Lung             Saline_4h_Lung            
## [19] LPS_4h_Lung                LPS_4h_Lung               
## [21] LPS_4h_Lung                LPS_4h_Lung               
## 4 Levels: LPS_4h_Lung PineNeedlesFlame_4h_Lung ... Saline_4h_Lung

# then setting a control label
ctrl <- "Saline_4h_Lung"

# and extracting a vector of just our treatment groups
trt_groups <- setdiff(groups,ctrl)

# let's view this vector
trt_groups

## [1] "PineNeedlesSmolder_4h_Lung" "PineNeedlesFlame_4h_Lung"  
## [3] "LPS_4h_Lung"

RUVseq contains its own set of plotting and normalization functions, though requires input of what’s called an object of S4 class SeqExpressionSet. Let’s go ahead and make this object, using the RUVseq function newSeqExpressionSet():

exprSet <- newSeqExpressionSet(as.matrix(countdata),phenoData = data.frame(groups,row.names=colnames(countdata)))

And then use this object to generate some exploratory plots using built-in tools within RUVseq. First starting with some bar charts summarizing overall data distributions per sample:

colors <- brewer.pal(4, "Set2")
plotRLE(exprSet, outline=FALSE, ylim=c(-4, 4), col=colors[groups])

We can see from this plot that some of the samples show distributions that may vary from the overall - for instance, one of the flaming pine needles-exposed samples (in orange) is far lower than the rest.

Then viewing a PCA plot of these samples:

colors <- brewer.pal(4, "Set2")
plotPCA(exprSet, col=colors[groups], cex=1.2)

This PCA plot shows pretty good data distributions, with samples mainly showing groupings based upon exposure condition (e.g., LPS), which is to be expected. With this, we can conclude that there may be some sources of unwanted variation, but not a huge amount. Let’s see what the data look like after running RUV.

Now to actually run the RUVseq algorithm, to control for potential sources of sample heterogeneity, we need to first construct a matrix specifying the replicates (samples of the same exposure condition):

# Construct a matrix specifying the replicates (samples of the same exposure condition) for running RUV
differences <- makeGroups(groups)

# Viewing this new matrix
head(differences)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   19   20   21   22   -1   -1
## [2,]    7    8    9   10   11   12
## [3,]    1    2    3    4    5    6
## [4,]   13   14   15   16   17   18

This matrix groups the samples by exposure condition. Here, each of the four rows represents one of the four exposure conditions, and each of the six columns represents a possible sample. Since the LPS exposure condition only had four samples, instead of six like the rest of the exposure conditions, a value of -1 is automatically used as a place holder to fill out the matrix. The samples in the matrix are identified by the index of the sample in the previously defined ‘groups’ factor that was used to generate the matrix. For example, the PineNeedlesSmolder_4h_Lung samples are the the first six samples contained in the ‘groups’ factor, so in the matrix, samples of this exposure condition are identified as ‘1’,‘2’,‘3’,‘4’,‘5’, and ‘6’.

Let’s now implement the RUVseq algorithm and, for this example, capture one factor (k=1) of unwanted variation. Note that the k parameter can be modified to capture additional factors, if necessary.

# Now capture 1 factor (k=1) of unwanted variation
ruv_set <- RUVs(exprSet, rownames(countdata), k=1, differences)

This results in a list of objects within ruv_set, which include the following important pieces of information:

Estimated factors of unwanted variation are provided in the phenoData object, as viewed using the following:

# viewing the estimated factors of unwanted variation in the column W_1
pData(ruv_set)

##                                            groups       W_1
## M13_PineNeedlesSmolder PineNeedlesSmolder_4h_Lung 1.3777900
## M14_PineNeedlesSmolder PineNeedlesSmolder_4h_Lung 1.2392662
## M15_PineNeedlesSmolder PineNeedlesSmolder_4h_Lung 1.0687702
## M16_PineNeedlesSmolder PineNeedlesSmolder_4h_Lung 1.2448747
## M17_PineNeedlesSmolder PineNeedlesSmolder_4h_Lung 1.3156003
## M18_PineNeedlesSmolder PineNeedlesSmolder_4h_Lung 1.4136600
## M49_PineNeedlesFlame     PineNeedlesFlame_4h_Lung 1.3343115
## M50_PineNeedlesFlame     PineNeedlesFlame_4h_Lung 0.4446084
## M51_PineNeedlesFlame     PineNeedlesFlame_4h_Lung 1.0172990
## M52_PineNeedlesFlame     PineNeedlesFlame_4h_Lung 1.0332373
## M53_PineNeedlesFlame     PineNeedlesFlame_4h_Lung 0.7746752
## M54_PineNeedlesFlame     PineNeedlesFlame_4h_Lung 1.0011086
## M31_Saline                         Saline_4h_Lung 0.6582117
## M32_Saline                         Saline_4h_Lung 1.2518829
## M33_Saline                         Saline_4h_Lung 1.1892456
## M34_Saline                         Saline_4h_Lung 1.2607852
## M35_Saline                         Saline_4h_Lung 1.3141520
## M36_Saline                         Saline_4h_Lung 1.4335200
## M67_LPS                               LPS_4h_Lung 1.0341947
## M68_LPS                               LPS_4h_Lung 1.2823240
## M69_LPS                               LPS_4h_Lung 1.4460347
## M70_LPS                               LPS_4h_Lung 1.4376930

Normalized counts obtained by regressing the original counts on the unwanted factors (normalizedCounts object within ruv_set). Note that the normalized counts should only used for exploratory purposes and not subsequent differential expression analyses. For additional information on this topic, please refer official RUVSeq documentation. The normalized counts can be viewed using the following:

# Viewing the head of the normalized count data, accounting for unwanted variation
head(normCounts(ruv_set))

##                     M13_PineNeedlesSmolder M14_PineNeedlesSmolder
## 0610009B22Rik_56046                      4                      5
## 0610010F05Rik_69119                   2552                   2630
## 0610010F05Rik_74637                     17                     13
## 0610010K14Rik_31619                   1602                   1774
## 0610010K14Rik_31873                    711                   1009
## 0610010K14Rik_68949                     23                     25
##                     M15_PineNeedlesSmolder M16_PineNeedlesSmolder
## 0610009B22Rik_56046                      8                      8
## 0610010F05Rik_69119                   2474                   1634
## 0610010F05Rik_74637                     17                      3
## 0610010K14Rik_31619                   1639                   1722
## 0610010K14Rik_31873                    819                    992
## 0610010K14Rik_68949                     19                     28
##                     M17_PineNeedlesSmolder M18_PineNeedlesSmolder
## 0610009B22Rik_56046                      6                      4
## 0610010F05Rik_69119                   2343                   2352
## 0610010F05Rik_74637                     12                      7
## 0610010K14Rik_31619                   1979                   1726
## 0610010K14Rik_31873                    889                    888
## 0610010K14Rik_68949                     20                     30
##                     M49_PineNeedlesFlame M50_PineNeedlesFlame
## 0610009B22Rik_56046                    4                    5
## 0610010F05Rik_69119                 1767                 1563
## 0610010F05Rik_74637                   12                    6
## 0610010K14Rik_31619                 1480                 1134
## 0610010K14Rik_31873                  586                  599
## 0610010K14Rik_68949                   23                   27
##                     M51_PineNeedlesFlame M52_PineNeedlesFlame
## 0610009B22Rik_56046                    4                    4
## 0610010F05Rik_69119                 1918                 1711
## 0610010F05Rik_74637                    4                    9
## 0610010K14Rik_31619                 1390                 1763
## 0610010K14Rik_31873                  683                  797
## 0610010K14Rik_68949                   26                   27
##                     M53_PineNeedlesFlame M54_PineNeedlesFlame M31_Saline
## 0610009B22Rik_56046                    5                    5          3
## 0610010F05Rik_69119                 1751                 1582       1891
## 0610010F05Rik_74637                    6                    3         10
## 0610010K14Rik_31619                 1370                 1419       1340
## 0610010K14Rik_31873                  593                  647        561
## 0610010K14Rik_68949                   20                   24         19
##                     M32_Saline M33_Saline M34_Saline M35_Saline M36_Saline
## 0610009B22Rik_56046          2          4          3          4          4
## 0610010F05Rik_69119       2507       2518       2524       1831       1878
## 0610010F05Rik_74637          7          7          5          7          8
## 0610010K14Rik_31619       1563       1632       1438       1280        967
## 0610010K14Rik_31873        685        815        703        646        526
## 0610010K14Rik_68949         16         21         19         22         20
##                     M67_LPS M68_LPS M69_LPS M70_LPS
## 0610009B22Rik_56046       3       2       4       5
## 0610010F05Rik_69119    1906    1597    1461    1201
## 0610010F05Rik_74637       6       6       7       5
## 0610010K14Rik_31619     964    1012     776     638
## 0610010K14Rik_31873     622     702     522     496
## 0610010K14Rik_68949      21      21      18      15

Let’s again generate an exploratory plot using this updated dataset, focusing on the bar chart view since that was the most informative pre-RUV. Here are the updated bar charts summarizing overall data distributions per sample:

colors <- brewer.pal(4, "Set2")
plotRLE(ruv_set, outline=FALSE, ylim=c(-4, 4), col=colors[groups])

This plot shows overall tighter data that are more similarly distributed across samples. Therefore, it is looking like this RUV addition improved the overall distribution of this dataset. It is important not to over-correct/over-smooth your datasets, so implement these types of pre-processing steps with caution. One strategy that we commonly employ to gage whether data smoothing is needed/applied correctly is to run the statistical models with and without correction of potential sources of heterogeneity, and critically evaluate similarities vs differences produced in the results.

Answer to Environmental Health Question 4

With this, we can now answer Environmental Health Question #4: What is an approach that analysts can use when evaluating transcriptomic data from tissues of mixed cellular composition to aid in controlling for sources of sample heterogeneity?

Answer: Remove unwanted variation (RUV), among other approaches, including surrogate variable analysis (SVA).

Identifying Genes that are Significantly Differentially Expressed by Environmental Exposure Conditions (e.g., Biomass Smoke Exposure)

At this point, we have completed several data pre-processing, QA/QC, and additional steps to prepare our example transcriptomics data for statistical analysis. And finally, we are ready to run the overall statistical model to identify genes that are altered in expression in association with different biomass burn conditions.

Here we leverage the DESeq2 package to carry out these statistical comparisons. This package is now the most commonly implemented analysis pipeline used for transcriptomic data, including sequencing data as well as transcriptomic data produced via other technologies (e.g., Nanostring, Fluidigm, and other gene expression technologies). This package is extremely well-documented and we encourage trainees to leverage these resources in parallel with the current training module when carrying out their own transcriptomics analyses in R:

Bioconductor website
Vignette
Manual
Primary citation: Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. PMID: 25516281

In brief, the basic calculations employed within the DESeq2 underlying algorithms include the following:

1. Estimate size factors. In the first step, size factors are estimated to help account for potential differences in the sequencing depth across samples. It is similar to a normalization parameter in the model.

2. Normalize count data. DESeq2 employs different normalization algorithms depending on the parameters selected / stage of analysis. The most commonly employed method is called the median of ratios, which takes into account sequencing depth and RNA composition, as described here. Specifically, these normalized values are calculated as counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene. DESeq2 then transforms these data using variance stabilization within the final statistical model. Because of these two steps, we prefer to export both the median of ratios normalized data as well as the variance stabilization transformed data, to save in our records and use when generating plots of expression levels for specific genes we are interested in. These steps are detailed below.

3. Estimate dispersion. The dispersion estimate takes into account the relationship between the variance of an observed count and its mean value. It is similar to a variance parameter. In DESeq2, dispersion is estimated using a maximum likelihood and empirical bayes approach.

4. Fit negative binomial generalized linear model (GLM). Finally, a negative binomial model is fit for each gene using the design formula that will be described within the proceeding code. The Wald test is performed to test if log fold changes in expression (typically calculated as log(average exposed / average unexposed)) significantly differ from zero. Statistical p-values are reported from this test and also adjusted for multiple testing using the Benjamini and Hochberg procedure.

Note that these calculations, among others, are embedded within the DESeq2 functions, so we do not need to code for them ourselves. Instead, we just need to make sure that we set-up the DESeq2 functions correctly, such that these calculations are carried out appropriately in our final transcriptomics analyses.

Setting up the DESeq2 experiment

Here we provide example script that is used to identify which genes are significantly differentially expressed in association with the example biomass smoke exposures, smoldering pine needles and flaming pine needles, as well as a positive inflammation control, LPS.

First, we need to set-up the DESeq2 experiment:

# Set up our experiment using our RUV adjusted count and phenotype data.
# Our design indicates that our count data is dependent on the exposure condition (groups variable) and our factor of unwanted variation, and we have specified that there not be an intercept term through the use of '~0'
dds <- DESeqDataSetFromMatrix(countData = counts(ruv_set), # Grabbing count data from the 'ruv_set' object
                                 colData = pData(ruv_set), # Grabbing the phenotype data and corresponding factor of unwanted variation from the 'ruv_set' object
                                 design = ~0+groups+W_1) # Setting up the statistical formula (see below)

For the formula design, we use a ‘~0’ at the front to not include an intercept term, and then also account for the exposure condition (groups) and the previously calculated factors of unwanted variation (W_1) of the samples. Formula design is an important step and should be carefully considered for each individual analysis. Other resources, including official DESeq2 documentation, are available for consultation regarding formula design, as the specifics of formula design are beyond the scope of this training module.

It is worth noting that, by default, DESeq2 will use the last variable in the design formula (W_1) in this case, as the default variable to be output from the “results” function. Additionally, if the variable is categorical, it will display results comparing the reference level to the last level of that variable. To get results for other variables or to see other comparisons within a categorical variable, we can use the contrast parameter, which will be demonstrated below.

Estimating size factors

# Estimate size factors from the dds object that was just created as the experiment above
dds <- estimateSizeFactors(dds)
sizeFactors(dds)  # viewing the size factors

## M13_PineNeedlesSmolder M14_PineNeedlesSmolder M15_PineNeedlesSmolder 
##              1.2668895              1.3040762              1.1782071 
## M16_PineNeedlesSmolder M17_PineNeedlesSmolder M18_PineNeedlesSmolder 
##              1.1723749              1.2904950              1.3405500 
##   M49_PineNeedlesFlame   M50_PineNeedlesFlame   M51_PineNeedlesFlame 
##              1.1313237              0.6313099              0.9328970 
##   M52_PineNeedlesFlame   M53_PineNeedlesFlame   M54_PineNeedlesFlame 
##              1.0050414              0.7649632              0.8747854 
##             M31_Saline             M32_Saline             M33_Saline 
##              0.8050127              1.0480725              1.0389623 
##             M34_Saline             M35_Saline             M36_Saline 
##              1.0282261              0.9519266              1.0409103 
##                M67_LPS                M68_LPS                M69_LPS 
##              0.9308588              1.1002076              0.9793051 
##                M70_LPS 
##              0.8575718

Calculating and exporting normalized counts

Here, we extract normalized counts and variance stabilized counts.

# Extract normalized count data
normcounts <- as.data.frame(counts(dds, normalized=TRUE))

# Transforming normalized counts through variance stabilization
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
vsd_matrix <- as.matrix(assay(vsd))

We could also export them using code such as:

# Export data
write.csv(normcounts, "Chapter_6/Module6_2_Input/Module6_2_Output_NormalizedCounts.csv")
write.csv(vsd_matrix, "Chapter_6/Module6_2_Input/Module6_2_Output_VSDCounts.csv", row.names=TRUE)

Running the final DESeq2 experiment

Here, we are finally ready to run the actual statistical comparisons (exposed vs control samples) to calculate fold changes and p-values that describe the degree to which each gene may or may not be altered at the expression level in association with treatment.

For this example, we would like to run three different comparisons:
(1) Smoldering Pine Needles vs. Control
(2) Flaming Pine Needles vs. Control
(3) LPS vs. Control
which we can easily code for using a loop function, as detailed below.

Note that we have commented out the line of code for writing out the CSV because we do not need it for the rest of the module, but this could be used if you need to write out and view results in an external application such as Excel for supplementary materials.

# Run experiment
dds_run <- DESeq(dds, betaPrior=FALSE) 

# Loop through and extract and export results for all contrasts (treatments vs. control)
for (trt in trt_groups){ # Iterate for each of the treatments listed in 'trt_groups'
  cat(trt) # Print which treatment group we are on in the loop 
  res <- results(dds_run, pAdjustMethod = "BH", contrast = c("groups",trt,ctrl)) # Extract the results of the DESeq2 analysis specifically for the comparison of the treatment group for the current iteration of the loop with the control group
  summary(res) # Print out a high-level summary of the results
  ordered <- as.data.frame(res[order(res$padj),]) # Make a dataframe of the results and order them by adjusted p-value from lowest to highest
  top10 <- head(ordered, n=10) # Make dataframe of the first ten rows of the ordered results
  cat("\nThe 10 most significantly differentially expressed genes by adjusted p-value:\n\n")
  print(top10) # View the first ten rows of the ordered results
  pfilt.05 <- nrow(ordered %>% filter(padj<0.05)) # Get the number of genes that are significantly differentially expressed where padj < 0.05
  cat("\nThe number of genes showing significant differential expression where padj < 0.05 is ", pfilt.05)
  pfilt.10 <- nrow(ordered %>% filter(padj<0.1)) # Get the number of genes that are significantly differentially expressed where padj < 0.10
  cat("\nThe number of genes showing significant differential expression where padj < 0.10 is ", pfilt.10,"\n\n")
  # write.csv(ordered, paste0("Module6_2_Output_StatisticalResults_",trt ,".csv")) ## Export the full dataframe of ordered results as a csv
}

## PineNeedlesSmolder_4h_Lung
## out of 16664 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 457, 2.7%
## LFC < 0 (down)     : 697, 4.2%
## outliers [1]       : 0, 0%
## low counts [2]     : 1293, 7.8%
## (mean count < 21)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## 
## The 10 most significantly differentially expressed genes by adjusted p-value:
## 
##                baseMean log2FoldChange      lfcSE      stat       pvalue
## Hbs1l_72436   367.64772     -1.1465230 0.13614949 -8.421060 3.730933e-17
## Cetn3_30290  2350.31984     -0.7197604 0.09379276 -7.673944 1.667868e-14
## Grasp_56900   277.43936      0.9759274 0.14177298  6.883734 5.830390e-12
## Pcsk7_57804   180.24040      0.8261941 0.13529825  6.106465 1.018617e-09
## Plscr4_55764  178.95216      1.3193654 0.21751536  6.065620 1.314459e-09
## Socs3_79479   183.28293      1.2535533 0.20740514  6.043983 1.503550e-09
## Kat2b_64780   444.97603     -0.7197105 0.12034633 -5.980327 2.226896e-09
## Ptp4a2_62762 1840.75247     -0.7802857 0.13076704 -5.966990 2.416698e-09
## Rmi1_56327    377.57936     -0.8548903 0.14288443 -5.983089 2.189449e-09
## Abcd2_62271    57.46455     -1.2504459 0.21464501 -5.825646 5.689193e-09
##                      padj
## Hbs1l_72436  5.734817e-13
## Cetn3_30290  1.281840e-10
## Grasp_56900  2.987297e-08
## Pcsk7_57804  3.851843e-06
## Plscr4_55764 3.851843e-06
## Socs3_79479  3.851843e-06
## Kat2b_64780  4.127451e-06
## Ptp4a2_62762 4.127451e-06
## Rmi1_56327   4.127451e-06
## Abcd2_62271  8.744859e-06
## 
## The number of genes showing significant differential expression where padj < 0.05 is  679
## The number of genes showing significant differential expression where padj < 0.10 is  1154 
## 
## PineNeedlesFlame_4h_Lung
## out of 16664 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 411, 2.5%
## LFC < 0 (down)     : 575, 3.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 10)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## 
## The 10 most significantly differentially expressed genes by adjusted p-value:
## 
##                 baseMean log2FoldChange     lfcSE      stat       pvalue
## Tmem109_70229  351.49267      1.2681226 0.1637785  7.742913 9.716419e-15
## Rhof_64396      58.57258      1.4767798 0.2210640  6.680328 2.384081e-11
## Plscr4_55764   178.95216      1.5349460 0.2393976  6.411701 1.439049e-10
## Pip4k2a_67992  796.64597      0.8262520 0.1299839  6.356571 2.063073e-10
## Zranb2_68671   739.91379      0.8131628 0.1330277  6.112735 9.793822e-10
## Cul1_31957     896.52286     -0.7061150 0.1203618 -5.866602 4.448172e-09
## Ppfibp1_75630  655.07197      1.4148222 0.2481843  5.700692 1.193223e-08
## Mbnl1_30670    851.18586     -1.0159639 0.1811247 -5.609197 2.032670e-08
## Atp5k_68198   1096.49479      1.1003470 0.1983469  5.547588 2.896380e-08
## Gclm_29258     138.54250      1.0993904 0.1985396  5.537387 3.070173e-08
##                       padj
## Tmem109_70229 1.619144e-10
## Rhof_64396    1.986416e-07
## Plscr4_55764  7.993439e-07
## Pip4k2a_67992 8.594761e-07
## Zranb2_68671  3.264085e-06
## Cul1_31957    1.235406e-05
## Ppfibp1_75630 2.840553e-05
## Mbnl1_30670   4.234052e-05
## Atp5k_68198   5.116136e-05
## Gclm_29258    5.116136e-05
## 
## The number of genes showing significant differential expression where padj < 0.05 is  515
## The number of genes showing significant differential expression where padj < 0.10 is  986 
## 
## LPS_4h_Lung
## out of 16664 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2960, 18%
## LFC < 0 (down)     : 2947, 18%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 10)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## 
## The 10 most significantly differentially expressed genes by adjusted p-value:
## 
##                baseMean log2FoldChange     lfcSE     stat        pvalue
## Usp18_32270   1319.2259       3.829526 0.1459140 26.24508 8.133515e-152
## Usp18_67659    414.5959       3.739864 0.1679851 22.26307 8.428171e-110
## Ifit1_29859    862.3286       5.000365 0.2259553 22.12989 1.629877e-108
## Oasl2_67419   1437.6182       3.710164 0.1904583 19.48019  1.616902e-84
## Gbp5_77887     600.7219       5.272287 0.2769982 19.03365  8.977554e-81
## Oasl2_67043    292.4482       3.898484 0.2096435 18.59578  3.476310e-77
## Rnf213_57840  1052.0454       3.168052 0.1731212 18.29962  8.332621e-75
## Cmpk2_70664    429.4232       4.592488 0.2518916 18.23200  2.875888e-74
## Cmpk2_29189    695.1997       4.264136 0.2386279 17.86939  2.042134e-71
## Eif2ak2_30238 1585.6071       2.671974 0.1495526 17.86645  2.152596e-71
##                        padj
## Usp18_32270   1.355369e-147
## Usp18_67659   7.022352e-106
## Ifit1_29859   9.053424e-105
## Oasl2_67419    6.736013e-81
## Gbp5_77887     2.992039e-77
## Oasl2_67043    9.654872e-74
## Rnf213_57840   1.983640e-71
## Cmpk2_70664    5.990475e-71
## Cmpk2_29189    3.587085e-68
## Eif2ak2_30238  3.587085e-68
## 
## The number of genes showing significant differential expression where padj < 0.05 is  4813
## The number of genes showing significant differential expression where padj < 0.10 is  5907

Answer to Environmental Health Question 5

With this, we can now answer Environmental Health Question #5: How many genes showed significant differential expression associated with flaming pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?

Answer: 515 genes

Answer to Environmental Health Question 6

With this, we can also now answer Environmental Health Question #6: How many genes showed significant differential expression associated with smoldering pine needles exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?

Answer: 679 genes

Answer to Environmental Health Question 7

And, we can answer Environmental Health Question #7: How many genes showed significant differential expression associated with lipopolysaccharide (LPS) exposure in the mouse lung, based on a statistical filter of a multiple test corrected p-value (padj) < 0.05?

Answer: 4,813 genes

Together, we find that exposure to both flaming and smoldering of pine needles caused substantial disruptions in gene expression profiles. LPS serves as a positive control for inflammation and produced the greatest transcriptomic response.

Visualizing Statistical Results using MA Plots

MA plots represent a common method of visualization that illustrates differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values.

Here, we leverage MA plots to show how log fold changes relate to expression levels. In these plots, the log fold change is plotted on the y-axis and expression values are plotted along the x-axis, and dots are colored according to statistical significance (using padj<0.05 as the statistical filter). Here we will generate an MA plot for Flaming Pine Needles.

res <- results(dds_run, pAdjustMethod = "BH", contrast = c("groups","PineNeedlesFlame_4h_Lung",ctrl)) # Re-extract the DESeq2 results for the flaming pine needles 
MA <- data.frame(res) # Make a preliminary dataframe of the flaming pine needle results
MA_ns <- MA[ which(MA$padj>=0.05),]  # Non-significant genes to plot
MA_up <- MA[ which(MA$padj<0.05 & MA$log2FoldChange > 0),]   # Significant up-regulated genes to plot
MA_down <- MA[ which(MA$padj<0.05 & MA$log2FoldChange < 0),]  #Significant down-regulated genes to plot

ggplot(MA_ns, aes(x = baseMean, y = log2FoldChange)) + # Plot data with counts on x-axis and log2 fold change on y-axis
  geom_point(color="gray75", size = .5) + # Set point size and color
  
  geom_point(data = MA_up, color="firebrick", size=1, show.legend = TRUE) + # Plot the up-regulated significant genes
  geom_point(data = MA_down, color="dodgerblue2", size=1, show.legend = TRUE) + # Plot down-regulated significant genes
  
  theme_bw() + # Change theme of plot from gray to black and white
  
  # We want to log10 transform x-axis for better visualizations
  scale_x_continuous(trans = "log10", breaks=c(1,10,100, 1000, 10000, 100000, 1000000), labels=c("1","10","100", "1000", "10000", "100000", "1000000")) + 
  # We will bound y axis as well to better fit data while not leaving out too many points
  scale_y_continuous(limits=c(-2, 2)) +
  
  xlab("Expression (Normalized Count)") + ylab(expression(Log[2]*" Fold Change")) + # Add labels for axes
  geom_hline(yintercept=0) # Add horizontal line at 0

An appropriate title for this figure could be:

“Figure X. MA plot of fold change in expression as function of gene expression resulting from 4 hours of exposure to flaming pine needles in mice lung tissues. Significantly upregulated genes (log₂FC > 0 and p adjust < 0.05) are shown in red and significantly downregulated genes (log₂FC < 0 and p adjust < 0.05) are shown in blue. Genes significantly associated are displayed in gray.”

Visualizing Statistical Results using Volcano Plots

Similar to MA plots, volcano plots provide visualizations of fold changes in expression from transcriptomic data. However, instead of plotting these values against expression, log fold change is plotted against (adjusted) p-values in volcano plots. Here, we use functions within the EnhancedVolcano package to generate a volcano plot for Flaming Pine Needles.

Running the EnhancedVolcano() function to generate an example volcano plot:

Vol <- data.frame(res) # Dataset to use for plotting

EnhancedVolcano(Vol, 
                lab = rownames(res), # Label significant genes from dataset (can be a column name)
                x = 'log2FoldChange', # Column name in dataset with l2fc information
                y = 'padj', # Column name in dataset with adjusted p-value information
                ylab = "-Log(FDR-adjusted p value)", # Y-axis label
                pCutoff= 0.05, # Set p-value cutoff
                ylim=c(0,5), # Limit y-axis for better plot visuals
                xlim=c(-2,2), # Limit x-axis (similar to in MA plot y-axis)
                title= NULL, # Removing title
                subtitle = NULL, # Removing subtitle
                legendPosition = 'bottom') # Put legend on bottom

An appropriate title for this figure could be:

“Figure X. Volcano plot of lung genes resulting from 4 hours of exposure to flaming pine needles. Genes are colored according to level of significant differential loading in exposed vs unexposed (vehicle control) samples, using the following statistical cut-offs: P adjust (multiple test corrected p-value) <0.05 and fold change(FC) ±1.3 (log2FC ≥±0.3785).”

Interpretting Findings at the Systems Level through Pathway Enrichment Analysis

Pathway enrichment analysis is a very helpful tool that can be applied to interpret transcriptomic changes of interest in terms of systems biology. In these types of analyses, gene lists of interest are used to identify biological pathways that include genes present in your dataset more often than expected by chance alone. There are many tools that can be used to carry out pathway enrichment analyses. Here, we are using the R package, PIANO, to carry out the statistical enrichment analysis based on the lists of genes we previously identified with differential expression associated with flaming pine needles exposure.

To detail, the following input data are required to run PIANO:
(1) Your background gene sets, which represent all genes queried from your experiment (aka your ‘gene universe’)

The list of genes you are interested in evaluating pathway enrichment of; here, this represents the genes identified with significant differential expression associated with flaming pine needles
A underlying pathway dataset; here, we’re using the KEGG PATHWAY Database (KEGG), summarized through the Molecular Signature Database (MSigDB) into pre-formatted input files (.gmt) ready for PIANO.

Let’s organize these three required data inputs.

Background gene set:

# First grab the rownames of the 'res' object, which was redefined as the DESeq2 results for flaming pine needles prior to MA plot generation, and remove the BioSpyder numeric identifier using a sub function, while maintaining the gene symbol and place these IDs into a new list within the 'res' object (saved as 'id')
res$id <- gsub("_.*", "", rownames(res));

# Because these IDs now contain duplicate gene symbols, we need to remove duplicates
# One way to do this is to preferentially retain rows of data with the largest fold change (it doesn't really matter here, because we're just identifying unique genes within the background set)
res.ordered <- res[order(res$id, -abs(res$log2FoldChange) ), ] # sort by id and reverse of abs(log2foldchange)
res.ordered <- res.ordered[ !duplicated(res.ordered$id), ]     # removing gene duplicates

# Setting this as the background list
Background <- toupper(as.character(res.ordered$id))
Background[1:200] # viewing the first 200 genes in this background list

##   [1] "0610009B22RIK"  "0610010F05RIK"  "0610010K14RIK"  "0610012G03RIK" 
##   [5] "0610030E20RIK"  "0610040J01RIK"  "1110004F10RIK"  "1110008P14RIK" 
##   [9] "1110012L19RIK"  "1110017D15RIK"  "1110032A03RIK"  "1110038F14RIK" 
##  [13] "1110059E24RIK"  "1110059G10RIK"  "1110065P20RIK"  "1190002N15RIK" 
##  [17] "1190007I07RIK"  "1500011B03RIK"  "1500011K16RIK"  "1600002K03RIK" 
##  [21] "1600012H06RIK"  "1600014C10RIK"  "1700001L19RIK"  "1700003E16RIK" 
##  [25] "1700007K13RIK"  "1700012B09RIK"  "1700013F07RIK"  "1700016K19RIK" 
##  [29] "1700017B05RIK"  "1700020D05RIK"  "1700024G13RIK"  "1700025G04RIK" 
##  [33] "1700028P14RIK"  "1700029I15RIK"  "1700029J07RIK"  "1700030K09RIK" 
##  [37] "1700037C18RIK"  "1700037H04RIK"  "1700086D15RIK"  "1700088E04RIK" 
##  [41] "1700102P08RIK"  "1700109H08RIK"  "1700123O20RIK"  "1810009A15RIK" 
##  [45] "1810010H24RIK"  "1810013L24RIK"  "1810022K09RIK"  "1810030O07RIK" 
##  [49] "1810037I17RIK"  "1810043G02RIK"  "1810055G02RIK"  "2010005H15RIK" 
##  [53] "2010111I01RIK"  "2010300C02RIK"  "2010309G21RIK"  "2200002D01RIK" 
##  [57] "2210011C24RIK"  "2210016H18RIK"  "2210016L21RIK"  "2210408I21RIK" 
##  [61] "2300009A05RIK"  "2310007B03RIK"  "2310009B15RIK"  "2310011J03RIK" 
##  [65] "2310022A10RIK"  "2310022B05RIK"  "2310030G06RIK"  "2310033P09RIK" 
##  [69] "2310057M21RIK"  "2310061I04RIK"  "2410002F23RIK"  "2410004B18RIK" 
##  [73] "2410015M20RIK"  "2410131K14RIK"  "2510002D24RIK"  "2510009E07RIK" 
##  [77] "2600001M11RIK"  "2610001J05RIK"  "2610002M06RIK"  "2610008E11RIK" 
##  [81] "2610028H24RIK"  "2610301B20RIK"  "2610507B11RIK"  "2700049A03RIK" 
##  [85] "2700062C07RIK"  "2700081O15RIK"  "2700097O09RIK"  "2810004N23RIK" 
##  [89] "2810021J22RIK"  "2810408A11RIK"  "2900026A02RIK"  "3110001I22RIK" 
##  [93] "3110040N11RIK"  "3110082I17RIK"  "3830406C13RIK"  "4430402I18RIK" 
##  [97] "4732423E21RIK"  "4833414E09RIK"  "4833420G17RIK"  "4833427G06RIK" 
## [101] "4833439L19RIK"  "4921524J17RIK"  "4930402H24RIK"  "4930430F08RIK" 
## [105] "4930451G09RIK"  "4930453N24RIK"  "4930486L24RIK"  "4930503L19RIK" 
## [109] "4930523C07RIK"  "4930550C14RIK"  "4930562C15RIK"  "4931406C07RIK" 
## [113] "4931406P16RIK"  "4931414P19RIK"  "4932438A13RIK"  "4933408B17RIK" 
## [117] "4933415F23RIK"  "4933427D14RIK"  "4933434E20RIK"  "5330417C22RIK" 
## [121] "5430427O19RIK"  "5730480H06RIK"  "5830417I10RIK"  "6030458C11RIK" 
## [125] "6030468B19RIK"  "6330403K07RIK"  "6330417A16RIK"  "6430531B16RIK" 
## [129] "6430548M08RIK"  "6430550D23RIK"  "6720489N17RIK"  "8030462N17RIK" 
## [133] "9030624G23RIK"  "9130008F23RIK"  "9130019O22RIK"  "9130023H24RIK" 
## [137] "9130230L23RIK"  "9230104L09RIK"  "9330182L06RIK"  "9530068E07RIK" 
## [141] "9930012K11RIK"  "9930021J03RIK"  "9930111J21RIK1" "9930111J21RIK2"
## [145] "A130071D04RIK"  "A230050P20RIK"  "A2ML1"          "A430005L14RIK" 
## [149] "A430033K04RIK"  "A430078G23RIK"  "A530032D15RIK"  "A630001G21RIK" 
## [153] "A730034C02"     "A730049H05RIK"  "A830018L16RIK"  "A930002H24RIK" 
## [157] "A930004D18RIK"  "AA415038"       "AA986860"       "AACS"          
## [161] "AAED1"          "AAGAB"          "AAK1"           "AAMDC"         
## [165] "AAMP"           "AAR2"           "AARS"           "AARS2"         
## [169] "AARSD1"         "AASDH"          "AASS"           "AATF"          
## [173] "AATK"           "AB124611"       "ABCA1"          "ABCA17"        
## [177] "ABCA2"          "ABCA3"          "ABCA5"          "ABCA7"         
## [181] "ABCA8A"         "ABCA8B"         "ABCA9"          "ABCB10"        
## [185] "ABCB1A"         "ABCB1B"         "ABCB6"          "ABCB7"         
## [189] "ABCB8"          "ABCC1"          "ABCC3"          "ABCC4"         
## [193] "ABCC5"          "ABCD1"          "ABCD2"          "ABCD3"         
## [197] "ABCD4"          "ABCE1"          "ABCF1"          "ABCF3"

The list of genes identified with significant differential expression associated with flaming pine needles:

# Similar to the above script, but starting with the res$id object
# and filtering for genes with padj < 0.05

res.ordered <- res[order(res$id, -abs(res$log2FoldChange) ), ] #sort by id and reverse of abs(log2FC)
SigGenes <- toupper(as.character(res.ordered[which(res.ordered$padj<.05),"id"])) # pulling the genes with padj < 0.05
SigGenes <- SigGenes[ !duplicated(SigGenes)] # removing gene duplicates

length(SigGenes) # viewing the length of this significant gene list

## [1] 488

Therefore, this gene set includes 488 unique genes significantly associated with the Flaming Pine Needles condition, based on padj<0.05.

The underlying KEGG pathway dataset. Note that this file was simply downloaded from MSigDB, ready for upload as a .gmt file. Here, we use the loadGSC() function enabled through the PIANO package to upload and organize these pathways.

KEGG_Pathways <- loadGSC(file="Chapter_6/Module6_2_Input/Module6_2_InputData3_KEGGv7.gmt", type="gmt")

length(KEGG_Pathways$gsc) # viewing the number of biological pathways contained in the database

## [1] 186

This KEGG pathway database therefore includes 186 biological pathways available to query

With these data inputs ready, we can now run the pathway enrichment analysis. The enrichment statistic that is commonly employed through the PIANO package is based of a hypergeometric test, run through the runGSAhyper() function. This returns a p-value for each gene set from which you can determine enrichment status.

# Running the piano function based on the hypergeometric statistic
Results_GSA <- piano::runGSAhyper(genes=SigGenes, universe=Background,gsc=KEGG_Pathways, gsSizeLim=c(1,Inf), adjMethod = "fdr")

# Pulling the pathway enrichment results into a separate dataframe
PathwayResults <- as.data.frame(Results_GSA$resTab)

# Viewing the top of these pathway enrichment results (which are not ordered at the moment)
head(PathwayResults)

##                                                        p-value Adjusted p-value
## KEGG_N_GLYCAN_BIOSYNTHESIS                          0.77021314        1.0000000
## KEGG_OTHER_GLYCAN_DEGRADATION                       1.00000000        1.0000000
## KEGG_O_GLYCAN_BIOSYNTHESIS                          0.03553158        0.6656139
## KEGG_GLYCOSAMINOGLYCAN_DEGRADATION                  0.42056787        1.0000000
## KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE 1.00000000        1.0000000
## KEGG_GLYCEROLIPID_METABOLISM                        0.16241736        1.0000000
##                                                     Significant (in gene set)
## KEGG_N_GLYCAN_BIOSYNTHESIS                                                  1
## KEGG_OTHER_GLYCAN_DEGRADATION                                               0
## KEGG_O_GLYCAN_BIOSYNTHESIS                                                  3
## KEGG_GLYCOSAMINOGLYCAN_DEGRADATION                                          1
## KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE                         0
## KEGG_GLYCEROLIPID_METABOLISM                                                3
##                                                     Non-significant (in gene set)
## KEGG_N_GLYCAN_BIOSYNTHESIS                                                     34
## KEGG_OTHER_GLYCAN_DEGRADATION                                                  13
## KEGG_O_GLYCAN_BIOSYNTHESIS                                                     15
## KEGG_GLYCOSAMINOGLYCAN_DEGRADATION                                             12
## KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE                            12
## KEGG_GLYCEROLIPID_METABOLISM                                                   31
##                                                     Significant (not in gene set)
## KEGG_N_GLYCAN_BIOSYNTHESIS                                                    487
## KEGG_OTHER_GLYCAN_DEGRADATION                                                 488
## KEGG_O_GLYCAN_BIOSYNTHESIS                                                    485
## KEGG_GLYCOSAMINOGLYCAN_DEGRADATION                                            487
## KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE                           488
## KEGG_GLYCEROLIPID_METABOLISM                                                  485
##                                                     Non-significant (not in gene set)
## KEGG_N_GLYCAN_BIOSYNTHESIS                                                      11355
## KEGG_OTHER_GLYCAN_DEGRADATION                                                   11376
## KEGG_O_GLYCAN_BIOSYNTHESIS                                                      11374
## KEGG_GLYCOSAMINOGLYCAN_DEGRADATION                                              11377
## KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE                             11377
## KEGG_GLYCEROLIPID_METABOLISM                                                    11358

This dataframe therefore summarizes the enrichment p-value for each pathway, FDR adjusted p-value, number of significant genes in the gene set that intersect with genes in the pathway, etc.

With these results, let’s identify which pathways meet a statistical enrichment p-value filter of 0.05:

SigPathways <- PathwayResults[which(PathwayResults$`p-value` < 0.05),]
rownames(SigPathways)

##  [1] "KEGG_O_GLYCAN_BIOSYNTHESIS"                               
##  [2] "KEGG_HYPERTROPHIC_CARDIOMYOPATHY_HCM"                     
##  [3] "KEGG_ARRHYTHMOGENIC_RIGHT_VENTRICULAR_CARDIOMYOPATHY_ARVC"
##  [4] "KEGG_PROTEASOME"                                          
##  [5] "KEGG_OOCYTE_MEIOSIS"                                      
##  [6] "KEGG_VASCULAR_SMOOTH_MUSCLE_CONTRACTION"                  
##  [7] "KEGG_WNT_SIGNALING_PATHWAY"                               
##  [8] "KEGG_HEDGEHOG_SIGNALING_PATHWAY"                          
##  [9] "KEGG_FOCAL_ADHESION"                                      
## [10] "KEGG_ECM_RECEPTOR_INTERACTION"                            
## [11] "KEGG_COMPLEMENT_AND_COAGULATION_CASCADES"                 
## [12] "KEGG_GNRH_SIGNALING_PATHWAY"

Answer to Environmental Health Question 8

With this, we can now answer Environmental Health Question #8: What biological pathways are disrupted in association with flaming pine needles exposure in the lung, identified through systems level analyses?

Answer: Biological pathways involved in cardiopulmonary function (e.g., arrhythmogenic right ventricular cardiomyopathy, hypertrophic cardiomyopathy, vascular smooth muscle contraction), carcinogenesis signaling (e.g., Wnt signaling pathway, hedgehog signaling pathway), and hormone signaling (e.g., Gnrh signaling pathway), among others.

Concluding Remarks

In this module, users are guided through the uploading, organization, QA/QC, statistical analysis, and systems level analysis of an example -omics dataset based on transcriptomic responses to biomass burn scenarios, representing environmental exposure scenarios of growing concern worldwide. It is worth noting that the methods described herein represent a fraction of the approaches and tools that can be leveraged in the analysis of -omics datasets, and methods should be tailored to the purposes of each individual analysis’ goal. For additional example research projects that have leveraged -omics and systems biology to address environmental health questions, see the following select relevant publications:

Genomic publications evaluating gene-environment interactions and relations to disease etiology:

Balik-Meisner M, Truong L, Scholl EH, La Du JK, Tanguay RL, Reif DM. Elucidating Gene-by-Environment Interactions Associated with Differential Susceptibility to Chemical Exposure. Environ Health Perspect. 2018 Jun 28;126(6):067010. PMID: 29968567.
Ward-Caviness CK, Neas LM, Blach C, Haynes CS, LaRocque-Abramson K, Grass E, Dowdy ZE, Devlin RB, Diaz-Sanchez D, Cascio WE, Miranda ML, Gregory SG, Shah SH, Kraus WE, Hauser ER. A genome-wide trans-ethnic interaction study links the PIGR-FCAMR locus to coronary atherosclerosis via interactions between genetic variants and residential exposure to traffic. PLoS One. 2017 Mar 29;12(3):e0173880. PMID: 28355232.

Transcriptomic publications evaluating gene expression responses to environmental exposures and relations to disease etiology:

Chang Y, Rager JE, Tilton SC. Linking Coregulated Gene Modules with Polycyclic Aromatic Hydrocarbon-Related Cancer Risk in the 3D Human Bronchial Epithelium. Chem Res Toxicol. 2021 Jun 21;34(6):1445-1455. PMID: 34048650.
Chappell GA, Rager JE, Wolf J, Babic M, LeBlanc KJ, Ring CL, Harris MA, Thompson CM. Comparison of Gene Expression Responses in the Small Intestine of Mice Following Exposure to 3 Carcinogens Using the S1500+ Gene Set Informs a Potential Common Adverse Outcome Pathway. Toxicol Pathol. 2019 Oct;47(7):851-864. PMID: 31558096.
Manuck TA, Eaves LA, Rager JE, Fry RC. Mid-pregnancy maternal blood nitric oxide-related gene and miRNA expression are associated with preterm birth. Epigenomics. 2021 May;13(9):667-682. PMID: 33890487.

Epigenomic publications evaluating microRNA, CpG methylation, and/or histone methylation responses to environmental exposures and relations to disease etiology:

Chappell GA, Rager JE. Epigenetics in chemical-induced genotoxic carcinogenesis. Curr Opinion Toxicol. 2017 Oct; 6:10-17.
Rager JE, Bailey KA, Smeester L, Miller SK, Parker JS, Laine JE, Drobná Z, Currier J, Douillet C, Olshan AF, Rubio-Andrade M, Stýblo M, García-Vargas G, Fry RC. Prenatal arsenic exposure and the epigenome: altered microRNAs associated with innate and adaptive immune signaling in newborn cord blood. Environ Mol Mutagen. 2014 Apr;55(3):196-208. PMID: 24327377.
Rager JE, Bauer RN, Müller LL, Smeester L, Carson JL, Brighton LE, Fry RC, Jaspers I. DNA methylation in nasal epithelial cells from smokers: identification of ULBP3-related effects. Am J Physiol Lung Cell Mol Physiol. 2013 Sep 15;305(6):L432-8. PMID: 23831618.
Smeester L, Rager JE, Bailey KA, Guan X, Smith N, García-Vargas G, Del Razo LM, Drobná Z, Kelkar H, Stýblo M, Fry RC. Epigenetic changes in individuals with arsenicosis. Chem Res Toxicol. 2011 Feb 18;24(2):165-7. PMID: 21291286.

Metabolomic publications evaluating changes in the metabolome in response to environmental exposures and involved in disease etiology:

Lu K, Abo RP, Schlieper KA, Graffam ME, Levine S, Wishnok JS, Swenberg JA, Tannenbaum SR, Fox JG. Arsenic exposure perturbs the gut microbiome and its metabolic profile in mice: an integrated metagenomics and metabolomics analysis. Environ Health Perspect. 2014 Mar;122(3):284-91. PMID: 24413286; PMCID: PMC3948040.
Manuck TA, Lai Y, Ru H, Glover AV, Rager JE, Fry RC, Lu K. Metabolites from midtrimester plasma of pregnant patients at high risk for preterm birth. Am J Obstet Gynecol MFM. 2021 Jul;3(4):100393. PMID: 33991707.

Microbiome publications evaluating changes in microbiome profiles in relation to the environment and human disease:

Chi L, Bian X, Gao B, Ru H, Tu P, Lu K. Sex-Specific Effects of Arsenic Exposure on the Trajectory and Function of the Gut Microbiome. Chem Res Toxicol. 2016 Jun 20;29(6):949-51.PMID: 27268458.
Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012 Mar 13;13(4):260-70. PMID: 22411464.
Lu K, Abo RP, Schlieper KA, Graffam ME, Levine S, Wishnok JS, Swenberg JA, Tannenbaum SR, Fox JG. Arsenic exposure perturbs the gut microbiome and its metabolic profile in mice: an integrated metagenomics and metabolomics analysis. Environ Health Perspect. 2014 Mar;122(3):284-91. PMID: 24413286.

Exposome publications evaluating changes in chemical signatures in relation to the environment and human disease:

Rager JE, Strynar MJ, Liang S, McMahen RL, Richard AM, Grulke CM, Wambaugh JF, Isaacs KK, Judson R, Williams AJ, Sobus JR. Linking high resolution mass spectrometry data with exposure and toxicity forecasts to advance high-throughput environmental monitoring. Environ Int. 2016 Mar;88:269-280. PMID: 26812473.
Rappaport SM, Barupal DK, Wishart D, Vineis P, Scalbert A. The blood exposome and its role in discovering causes of disease. Environ Health Perspect. 2014 Aug;122(8):769-74. PMID: 24659601.
Viet SM, Falman JC, Merrill LS, Faustman EM, Savitz DA, Mervish N, Barr DB, Peterson LA, Wright R, Balshaw D, O’Brien B. Human Health Exposure Analysis Resource (HHEAR): A model for incorporating the exposome into health studies. Int J Hyg Environ Health. 2021 Jun;235:113768. PMID: 34034040.

Test Your Knowledge

Using “Module6_2_TYKInput1.csv” (gene counts) and “Module6_2_TYKInput2.csv” (sample info) datasets, which have already been run through the QC process described in this module and are ready for analysis:

Conduct a differential expression analysis associated with “Season” using DESeq2. (Don’t worry about including any covariates or using RUV).
Find the number of significant differentially expressed genes associated with “Season”, at the .05 level.