3.2 Gene Expression Omnibus
This training module was developed by Dr. Kyle R. Roell and Dr. Julia E. Rager
Fall 2021
Introduction to the Environmental Health Database, Gene Expression Omnibus (GEO)
GEO is a publicly available database repository of high-throughput gene expression data and hybridization arrays, chips, and microarrays that span genome-wide endpoints of genomics, transcriptomics, and epigenomics. The repository is organized and managed by the The National Center for Biotechnology Information (NCBI), which seeks to advance science and health by providing access to biomedical and genomic information. The three overall goals of GEO are to: (1) Provide a robust, versatile database in which to efficiently store high-throughput functional genomic data, (2) Offer simple submission procedures and formats that support complete and well-annotated data deposits from the research community, and (3) Provide user-friendly mechanisms that allow users to query, locate, review and download studies and gene expression profiles of interest.
Of high relevance to environmental health, data organized within GEO can be pulled and analyzed to address new environmental health questions, leveraging previously generated data. For example, we have pulled gene expression data from acute myeloid leukemia patients and re-analyzed these data to elucidate new mechanisms of epigenetically-regulated networks involved in cancer, that in turn, may be modified by environmental insults, as previously published in Rager et al. 2012. We have also pulled and analyzed gene expression data from published studies evaluating toxicity resulting from hexavalent chromium exposure, to further substantiate the role of epigenetic mediators in hexavelent chromium-induced carcinogenesis (see Rager et al. 2019). This training exercise leverages an additional dataset that we published and deposited through GEO to evaluate the effects of formaldehyde inhalation exposure, as detailed below.
Introduction to Training Module
This training module provides an overview on pulling and analyzing data deposited in GEO. As an example, data are pulled from the published GEO dataset recorded through the online series GSE42394. This series represents Affymetrix rat genome-wide microarray data generated from our previous study, aimed at evaluating the transcriptomic effects of formaldehyde across three tissues: the nose, blood, and bone marrow. For the purposes of this training module, we will focus on evaluating gene expression profiles from nasal samples after 7 days of exposure, collected from rats exposed to 2 ppm formaldehyde via inhalation. These findings, in addition to other epigenomic endpoint measures, have been previously published (see Rager et al. 2014).
This training module specifically guides trainees through the loading of required packages and data, including the manual upload of GEO data as well as the upload/organization of data leveraging the GEOquery package. Data are then further organized and combined with gene annotation information through the merging of platform annotation files. Example visualizations are then produced, including boxplots to evaluate the overall distribution of expression data across samples, as well as heat map visualizations that compare unscaled versus scaled gene expression values. Statistical analyses are then included to identify which genes are significantly altered in expression upon exposure to formaldehyde. Together, this training module serves as a simple example showing methods to access and download GEO data and to perform data organization, analysis, and visualization tasks through applications-based questions.
Training Module’s Environmental Health Questions
This training module was specifically developed to answer the following environmental health questions:
- What kind of molecular identifiers are commonly used in microarray-based -omics technologies?
- How can we convert platform-specific molecular identifiers used in -omics study designs to gene-level information?
- Why do we often scale gene expression signatures prior to heat map visualizations?
- What genes are altered in expression by formaldehyde inhalation exposure?
- What are the potential biological consequences of these gene-level perturbations?
Installing required R packages
If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you
if (!requireNamespace("tidyverse"))
install.packages("tidyverse")
if (!requireNamespace("reshape2"))
install.packages("reshape2")
# GEOquery, this will install BiocManager if you don't have it installed
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
::install("GEOquery") BiocManager
## Warning: package(s) not installed when version(s) same as current; use `force = TRUE` to
## re-install: 'GEOquery'
Loading R packages required for this session
library(tidyverse)
library(reshape2)
library(GEOquery)
For more information on the tidyverse package, see its associated CRAN webpage, primary webpage, and peer-reviewed article released in 2018.
For more information on the reshape2 package, see its associated CRAN webpage, R Documentation, and helpful website providing an introduction to the reshape2 package.
For more information on the GEOquery package, see its associated Bioconductor website and R Documentation file.
Loading and Organizing the Example Dataset
Let’s start by loading the GEO dataset needed for this training module. As explained in the introduction, this module walks through two methods of uploading GEO data: manual option vs automatic option using the GEOquery package. These two methods are detailed below.
GEO Data in R
1. Manually Downloading and Uploading GEO Files
In this first method, we will navigate to the datasets within the GEO website, manually download its associated text data file, save it in our working directory, and then upload it into our global environment in R.
For the purposes of this training exercise, we manually downloaded the GEO series matrix file from the GEO series webpage, located at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42394. The specific file that was downloaded was noted as GSE42394_series_matrix.txt, pulled by clicking on the link indicated by the red arrow from the GEO series webpage:
For simplicity, we also have already pre-filtered this file for the samples we are interested in, focusing on the rat nasal gene expression data after 7 days of exposure to gaseous formaldehyde. This filtered file was saved as: GSE42394_series_matrix_filtered.txt
At this point, we can simply read in this pre-filtered text file for the purposes of this training module
= read.table(file="Module3_2/Module3_2_GSE42394_series_matrix_filtered.txt",
geodata_manual header=T)
Because this is a manual approach, we have to also manually define the treated and untreated samples (based on manually opening the surrounding metadata from the GEO webpage)
Manually defining treated and untreated for these samples of interest:
= c("GSM1150940", "GSM1150941", "GSM1150942")
exposed_manual = c("GSM1150937", "GSM1150938", "GSM1150939") unexposed_manual
2. Uploading and Organizing GEO Files through the GEOquery Package
In this second method, we will leverage the GEOquery package, which allows for easier downloading and reading in of data from GEO without having to manually download raw text files, and manually assign sample attributes (e.g., exposed vs unexposed). This package is set-up to automatically merge sample information from GEO metadata files with raw genome-wide datasets.
Let’s first use the getGEO function (from the GEOquery package) to load data from our series matrix. Note that this line of code may take a couple of minutes
= getGEO(filename='Module3_2/Module3_2_GSE42394_series_matrix.txt') geo.getGEO.data
One of the reasons the getGEO package is so helpful is that we can automatically link a dataset with nicely organized sample information using the pData() function.
= pData(geo.getGEO.data) sampleInfo
Let’s view this sample information / metadata file, first by viewing what the column headers are.
colnames(sampleInfo)
## [1] "title" "geo_accession"
## [3] "status" "submission_date"
## [5] "last_update_date" "type"
## [7] "channel_count" "source_name_ch1"
## [9] "organism_ch1" "characteristics_ch1"
## [11] "characteristics_ch1.1" "characteristics_ch1.2"
## [13] "characteristics_ch1.3" "characteristics_ch1.4"
## [15] "characteristics_ch1.5" "treatment_protocol_ch1"
## [17] "growth_protocol_ch1" "molecule_ch1"
## [19] "extract_protocol_ch1" "label_ch1"
## [21] "label_protocol_ch1" "taxid_ch1"
## [23] "hyb_protocol" "scan_protocol"
## [25] "description" "data_processing"
## [27] "platform_id" "contact_name"
## [29] "contact_email" "contact_department"
## [31] "contact_institute" "contact_address"
## [33] "contact_city" "contact_zip/postal_code"
## [35] "contact_country" "supplementary_file"
## [37] "data_row_count" "age:ch1"
## [39] "cell type:ch1" "gender:ch1"
## [41] "strain:ch1" "time:ch1"
## [43] "treatment:ch1"
Then viewing the first five columns.
1:10,1:5] sampleInfo[
## title geo_accession
## GSM1150937 Nose_7DayControl_Rep1 [Affymetrix] GSM1150937
## GSM1150938 Nose_7DayControl_Rep2 [Affymetrix] GSM1150938
## GSM1150939 Nose_7DayControl_Rep3 [Affymetrix] GSM1150939
## GSM1150940 Nose_7DayExposed_Rep1 [Affymetrix] GSM1150940
## GSM1150941 Nose_7DayExposed_Rep2 [Affymetrix] GSM1150941
## GSM1150942 Nose_7DayExposed_Rep3 [Affymetrix] GSM1150942
## GSM1150943 WhiteBloodCells_7DayControl_Rep1 [Affymetrix] GSM1150943
## GSM1150944 WhiteBloodCells_7DayControl_Rep2 [Affymetrix] GSM1150944
## GSM1150945 WhiteBloodCells_7DayControl_Rep3 [Affymetrix] GSM1150945
## GSM1150946 WhiteBloodCells_7DayExposed_Rep1 [Affymetrix] GSM1150946
## status submission_date last_update_date
## GSM1150937 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150938 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150939 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150940 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150941 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150942 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150943 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150944 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150945 Public on Jan 07 2014 May 29 2013 Jan 07 2014
## GSM1150946 Public on Jan 07 2014 May 29 2013 Jan 07 2014
This shows that each sample is provided with a unique number starting with “GSM”, and these are described by information summarized in the “title” column. We can also see that these data were made public on Jan 7, 2014.
Let’s view the next five columns.
1:10,6:10] sampleInfo[
## type channel_count source_name_ch1
## GSM1150937 RNA 1 Nasal epithelial cells, 7 day, unexposed
## GSM1150938 RNA 1 Nasal epithelial cells, 7 day, unexposed
## GSM1150939 RNA 1 Nasal epithelial cells, 7 day, unexposed
## GSM1150940 RNA 1 Nasal epithelial cells, 7 day, exposed
## GSM1150941 RNA 1 Nasal epithelial cells, 7 day, exposed
## GSM1150942 RNA 1 Nasal epithelial cells, 7 day, exposed
## GSM1150943 RNA 1 Circulating white blood cells, 7 day, unexposed
## GSM1150944 RNA 1 Circulating white blood cells, 7 day, unexposed
## GSM1150945 RNA 1 Circulating white blood cells, 7 day, unexposed
## GSM1150946 RNA 1 Circulating white blood cells, 7 day, exposed
## organism_ch1 characteristics_ch1
## GSM1150937 Rattus norvegicus gender: male
## GSM1150938 Rattus norvegicus gender: male
## GSM1150939 Rattus norvegicus gender: male
## GSM1150940 Rattus norvegicus gender: male
## GSM1150941 Rattus norvegicus gender: male
## GSM1150942 Rattus norvegicus gender: male
## GSM1150943 Rattus norvegicus gender: male
## GSM1150944 Rattus norvegicus gender: male
## GSM1150945 Rattus norvegicus gender: male
## GSM1150946 Rattus norvegicus gender: male
We can see that information is provided here surrounding the type of sample that was analyzed (i.e., RNA), more information on the collected samples within the column ‘source_name_ch1’, and the organism (rat) is provided in the ‘organism_ch1’ column.
More detailed metadata information is provided throughout this file, as seen when viewing the column headers above.
Defining Samples
Now, we can use this information to define the samples we want to analyze. Note that this is the same step we did manually above.
In this training exercise, we are focusing on responses in the nose, so we can easily filter for cell type = Nasal epithelial cells (specifically in the “cell type:ch1” variable). We are also focusing on responses collected after 7 days of exposure, which we can filter for using time = 7 day (specifically in the “time:ch1” variable). We will also define exposed and unexposed samples using the variable “treatment:ch1”.
First, let’s subset the sampleInfo dataframe to just keep the samples we’re interested in
# Define a vector variable (here we call it 'keep') that will store rows we want to keep
= rownames(sampleInfo[which(sampleInfo$`cell type:ch1`=="Nasal epithelial cells"
keep & sampleInfo$`time:ch1`=="7 day"),])
# Then subset the sample info for just those samples we defined in keep variable
= sampleInfo[keep,] sampleInfo
Next, we can pull the exposed and unexposed animal IDs. Let’s first see how these are labeled within the “treatment:ch1” variable.
unique(sampleInfo$`treatment:ch1`)
## [1] "unexposed" "2 ppm formaldehyde"
And then search for the rows of data, pulling the sample animal IDs (which are in the variable ‘geo_accession’).
= sampleInfo[which(sampleInfo$`treatment:ch1`=="2 ppm formaldehyde"),
exposedIDs "geo_accession"]
= sampleInfo[which(sampleInfo$`treatment:ch1`=="unexposed"),
unexposedIDs "geo_accession"]
The next step is to pull the expression data we want to use in our analyses. The GEOquery function, exprs(), allows us to easily pull these data. Here, we can pull the data we’re interested in using the exprs() function, while defining the data we want to pull based off our previously generated ‘keep’ vector.
# As a reminder, this is what the 'keep' vector includes
# (i.e., animal IDs that we're interested in)
keep
## [1] "GSM1150937" "GSM1150938" "GSM1150939" "GSM1150940" "GSM1150941"
## [6] "GSM1150942"
# Using the exprs() function
= exprs(geo.getGEO.data[,keep]) geodata
Let’s view the full dataset as is now:
head(geodata)
## GSM1150937 GSM1150938 GSM1150939 GSM1150940 GSM1150941 GSM1150942
## 10700001 5786.60 5830.08 5637.34 5313.33 5557.04 5469.90
## 10700002 192.92 206.86 220.83 183.12 177.16 198.64
## 10700003 1820.98 1795.79 1735.70 1578.02 1681.58 1632.20
## 10700004 66.95 65.61 64.41 60.19 60.41 60.67
## 10700005 770.07 753.41 731.20 684.53 657.25 667.66
## 10700006 5.80 5.35 5.48 5.58 5.39 5.35
This now represents a matrix of data, with animal IDs as column headers and expression levels within the matrix.
Simplifying column names
These column names are not the easiest to interpret, so let’s rename these columns to indicate which animals were from the exposed vs. unexposed groups.
We need to first convert our expression dataset to a dataframe so we can edit columns names, and continue with downstream data manipulations that require dataframe formats.
= data.frame(geodata) geodata
Let’s remind ourselves what the column names are:
colnames(geodata)
## [1] "GSM1150937" "GSM1150938" "GSM1150939" "GSM1150940" "GSM1150941"
## [6] "GSM1150942"
Which ones of these are exposed vs unexposed animals can be determined by viewing our previously defined vectors.
exposedIDs
## [1] "GSM1150940" "GSM1150941" "GSM1150942"
unexposedIDs
## [1] "GSM1150937" "GSM1150938" "GSM1150939"
With this we can tell that the first three listed IDs are from unexposed animals, and the last three IDs are from exposed animals.
Let’s simplify the names of these columns to indicate exposure status and replicate number.
colnames(geodata) = c("Control_1", "Control_2", "Control_3", "Exposed_1",
"Exposed_2", "Exposed_3")
And we’ll now need to re-define our ‘exposed’ vs ‘unexposed’ vectors for downstream script.
= c("Exposed_1", "Exposed_2", "Exposed_3")
exposedIDs = c("Control_1", "Control_2", "Control_3") unexposedIDs
Viewing the data again:
head(geodata)
## Control_1 Control_2 Control_3 Exposed_1 Exposed_2 Exposed_3
## 10700001 5786.60 5830.08 5637.34 5313.33 5557.04 5469.90
## 10700002 192.92 206.86 220.83 183.12 177.16 198.64
## 10700003 1820.98 1795.79 1735.70 1578.02 1681.58 1632.20
## 10700004 66.95 65.61 64.41 60.19 60.41 60.67
## 10700005 770.07 753.41 731.20 684.53 657.25 667.66
## 10700006 5.80 5.35 5.48 5.58 5.39 5.35
These data are now looking easier to interpret/analyze. Still, the row identifiers include 8 digit numbers starting with “107…”. We know that this dataset is a gene expression dataset, but these identifiers, in themselves, don’t tell us much about what genes these are referring to. These numeric IDs specifically represent microarray probesetIDs, that were produced by the Affymetrix platform used in the original study.
But how can we tell which genes are represented by these data?!
Adding Gene Symbol Information
Each -omics dataset contained within GEO points to a specific platform that was used to obtain measurements. In instances where we want more information surrounding the molecular identifiers, we can merge the platform-specific annotation file with the molecular IDs given in the full dataset.
For example, let’s pull the platform-specific annotation file for this experiment.
- Let’s revisit the website that contained the original dataset on GEO
- Scroll down to where it lists “Platforms”, and there is a hyperlinked platform number “GPL6247” (see arrow below)
- Click on this, and you will be navigated to a different GEO website describing the Affymetrix rat array platform that was used in this analysis
- Note that this website also includes information on when this array became available, links to other experiments that have used this platform within GEO, and much more
- Here, we’re interested in pulling the corresponding gene symbol information for the probeset IDs
- To do so, scroll to the bottom, and click “Annotation SOFT table…” and download the corresponding .gz file within your working directory
- Unzip this, and you will find the master annotation file: “GPL6247.annot”
In this exercise, we’ve already done these steps and unzipped the file in our working directory. So at this point, we can simply read in this annotation dataset, still using the GEOquery function to help automate.
= GEOquery::getGEO(filename="Module3_2/Module3_2_GPL6247.annot") geo.annot
Now we can use the Table function from GEOquery to pull data from the annotation dataset.
= GEOquery::Table(geo.annot)[,c("ID", "Gene symbol")]
id.gene.table 1:10,1:2] id.gene.table[
## ID Gene symbol
## 1 10701620 Vom2r67///Vom2r5///Vom2r6///Vom2r4
## 2 10701630
## 3 10701632
## 4 10701636
## 5 10701643
## 6 10701648 Vom2r5
## 7 10701654 Vom2r6
## 8 10701663
## 9 10701666
## 10 10701668 Vom2r65///Vom2r1
With these two columns of data, we now have the needed IDs and gene symbols to match with our dataset.
Within the full dataset, we need to add a new column for the probeset ID, taken from the rownames, in preparation for the merging step.
$ID = rownames(geodata) geodata
We can now merge the gene symbol information by ID with our expression data.
= merge(geodata, id.gene.table, by="ID")
geodata_genes head(geodata_genes)
## ID Control_1 Control_2 Control_3 Exposed_1 Exposed_2 Exposed_3
## 1 10700001 5786.60 5830.08 5637.34 5313.33 5557.04 5469.90
## 2 10700002 192.92 206.86 220.83 183.12 177.16 198.64
## 3 10700003 1820.98 1795.79 1735.70 1578.02 1681.58 1632.20
## 4 10700004 66.95 65.61 64.41 60.19 60.41 60.67
## 5 10700005 770.07 753.41 731.20 684.53 657.25 667.66
## 6 10700006 5.80 5.35 5.48 5.58 5.39 5.35
## Gene symbol
## 1
## 2
## 3
## 4
## 5
## 6
Note that many of the probeset IDs do not map to full gene symbols, which is shown here by viewing the top few rows - this is expected in genome-wide analyses based on microarray platforms.
Let’s look at the first 25 unique genes in these data:
= unique(geodata_genes$`Gene symbol`)
UniqueGenes 1:25] UniqueGenes[
## [1] ""
## [2] "Vom2r67///Vom2r5///Vom2r6///Vom2r4"
## [3] "Vom2r5"
## [4] "Vom2r6"
## [5] "Vom2r65///Vom2r1"
## [6] "Vom2r65///Vom2r5///Vom2r1///Vom2r6///Vom2r4"
## [7] "Vom2r65///Vom2r5///Vom2r6///Vom2r4"
## [8] "Raet1e"
## [9] "Lrp11"
## [10] "Katna1"
## [11] "Ppil4"
## [12] "Zc3h12d"
## [13] "Shprh"
## [14] "Fbxo30"
## [15] "Epm2a"
## [16] "Sf3b5"
## [17] "Plagl1"
## [18] "Fuca2"
## [19] "Adat2"
## [20] "Hivep2"
## [21] "Nmbr"
## [22] "Cited2"
## [23] "Txlnb"
## [24] "Reps1"
## [25] "Perp"
Again, you can see that the first value listed is blank, representing probesetIDs that do not match to fully annotated gene symbols. Though the rest pertain for gene symbols annotated to the rat genome.
You can also see that some gene symbols have multiple entries, separated by “///”
To simplify identifiers, we can pull just the first gene symbol, and remove the rest by using gsub().
$`Gene symbol` = gsub("///.*", "", geodata_genes$`Gene symbol`) geodata_genes
Let’s alphabetize by main expression dataframe by gene symbol.
= geodata_genes[order(geodata_genes$`Gene symbol`),] geodata_genes
And then re-view these data:
1:5,] geodata_genes[
## ID Control_1 Control_2 Control_3 Exposed_1 Exposed_2 Exposed_3
## 1 10700001 5786.60 5830.08 5637.34 5313.33 5557.04 5469.90
## 2 10700002 192.92 206.86 220.83 183.12 177.16 198.64
## 3 10700003 1820.98 1795.79 1735.70 1578.02 1681.58 1632.20
## 4 10700004 66.95 65.61 64.41 60.19 60.41 60.67
## 5 10700005 770.07 753.41 731.20 684.53 657.25 667.66
## Gene symbol
## 1
## 2
## 3
## 4
## 5
In preparation for the visualization steps below, let’s reset the probeset IDs to rownames.
rownames(geodata_genes) = geodata_genes$ID
# Can then remove this column within the dataframe
$ID = NULL geodata_genes
Finally let’s rearrange this dataset to include gene symbols as the first column, right after rownames (probeset IDs).
= geodata_genes[,c(ncol(geodata_genes),1:(ncol(geodata_genes)-1))]
geodata_genes 1:5,] geodata_genes[
## Gene symbol Control_1 Control_2 Control_3 Exposed_1 Exposed_2
## 10700001 5786.60 5830.08 5637.34 5313.33 5557.04
## 10700002 192.92 206.86 220.83 183.12 177.16
## 10700003 1820.98 1795.79 1735.70 1578.02 1681.58
## 10700004 66.95 65.61 64.41 60.19 60.41
## 10700005 770.07 753.41 731.20 684.53 657.25
## Exposed_3
## 10700001 5469.90
## 10700002 198.64
## 10700003 1632.20
## 10700004 60.67
## 10700005 667.66
dim(geodata_genes)
## [1] 29214 7
Note that this dataset includes expression measures across 29,214 probes, representing 14,019 unique genes. For simplicity in the final exercises, let’s just filter for rows representing mapped genes.
= geodata_genes[!(geodata_genes$`Gene symbol` == ""), ]
geodata_genes dim(geodata_genes)
## [1] 16024 7
Note that this dataset now includes 16,024 rows with mapped gene symbol identifiers.
With this, we can now answer Environmental Health Question 1: What kind of molecular identifiers are commonly used in microarray-based -omics technologies?
Answer: Platform-specific probeset IDs.
We can also answer Environmental Health Question 2: How can we convert platform-specific molecular identifiers used in -omics study designs to gene-level information?
Answer: We can merge platform-specific IDs with gene-level information using annotation files.
Visualizing Data
Visualizing Gene Expression Data using Boxplots and Heat Maps
- To visualize the -omics data, we can generate boxplots, heat maps, any many other types of visualizations
- Here, we provide an example to plot a boxplot, which can be used to visualize the variability amongst samples
- We also provide an example to plot a heat map, comparing unscaled vs scaled gene expression profiles
- These visualizations can be useful to both simply visualize the data as well as identify patterns across samples or genes
Boxplot Visualizations
For this example, let’s simply use R’s built in boxplot() function.
We only want to use columns with our expression data (2 to 7), so let’s pull those columns when running the boxplot function.
boxplot(geodata_genes[,2:7])
There seem to be a lot of variability within each sample’s range of expression levels, with many outliers. This makes sense given that we are analyzing the expression levels across the rat’s entire genome, where some genes won’t be expressed at all while others will be highly expressed due to biological and/or potential technical variability.
To show plots without outliers, we can simply use outline=F.
boxplot(geodata_genes[,2:7], outline=F)
Heat Map Visualizations
Heat maps are also useful when evaluating large datasets.
There are many different packages you can use to generate heat maps. Here, we use the superheat package.
It also takes awhile to plot all genes across the genome, so to save time for this training module, let’s randomly select 100 rows to plot.
# To ensure that the same subset of genes are selected each time
= 101
set.seed
# Random selection of 100 rows
= sample(1:nrow(geodata_genes),100)
row.sample
# Heat map code
::superheat(geodata_genes[row.sample,2:7], # Only want to plot non-id/gene symbol columns (2 to 7)
superheatpretty.order.rows = TRUE,
pretty.order.cols = TRUE,
col.dendrogram = T,
row.dendrogram = T)
This produces a heat map with sample IDs along the x-axis and probeset IDs along the y-axis. Here, the values being displayed represent normalized expression values.
One way to improve our ability to distinguish differences between samples is to scale expression values across probes.
Scaling data
Z-score is a very common method of scaling that transforms data points to reflect the number of standard deviations they are from the overall mean. Z-score scaling data results in the overall transformation of a dataset to have an overall mean = 0 and standard deviation = 1.
Let’s see what happens when we scale this gene expression dataset by z-score across each probe. This can be easily done using the scale function.
This specific scale function works by centering and scaling across columns, but since we want to use it across probesets (organized as rows), we need to first transpose our dataset, then run the scale function.
= scale(t(geodata_genes[,2:7]), center=T, scale=T) geodata_genes_scaled
Now we can transpose it back to the original format (i.e., before it was transposed).
= t(geodata_genes_scaled) geodata_genes_scaled
And then view what the normalized and now scaled expression data look like for now a random subset of 100 probesets (representing genes).
With these data now scaled, we can more easily visualize patterns between samples.
We can also answer Environmental Health Question 3: Why do we often scale gene expression signatures prior to heat map visualizations?
Answer: To better visualize patterns in expression signatures between samples.
Now, with these data nicely organized, we can see how statistics can help find which genes show trends in expression associated with formaldehyde exposure.
Statistical Analyses
Statistical Analyses to Identify Genes altered by Formaldehyde
A simple way to identify differences between formaldehyde-exposed and unexposed samples is to use a t-test. Because there are so many tests being performed, one for each gene, it is also important to carry out multiple test corrections through a p-value adjustment method.
We need to run a t-test for each row of our dataset. This exercise demonstrates two different methods to run a t-test:
- Method 1: using a ‘for loop’
- Method 2: using the apply function (more computationally efficient)
Method 1 (m1): ‘For Loop’
Let’s first re-save the molecular probe IDs to a column within the dataframe, since we need those values in the loop function.
$ID = rownames(geodata_genes) geodata_genes
We also need to initially create an empty dataframe to eventually store p-values.
= matrix(0, nrow=nrow(geodata_genes), ncol=3)
pValue_m1 colnames(pValue_m1) = c("ID", "pval", "padj")
head(pValue_m1)
## ID pval padj
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
## [4,] 0 0 0
## [5,] 0 0 0
## [6,] 0 0 0
You can see the empty dataframe that was generated through this code.
Then we can loop through the entire dataset to acquire p-values from t-test statistics, comparing n=3 exposed vs n=3 unexposed samples.
for (i in 1:nrow(geodata_genes)) {
#Get the ID
= geodata_genes[i, "ID"];
ID.i
#Run the t-test and get the p-value
= t.test(geodata_genes[i,exposedIDs], geodata_genes[i,unexposedIDs])$p.value;
pval.i
#Store the data in the empty dataframe
"ID"] = ID.i;
pValue_m1[i,"pval"] = pval.i
pValue_m1[i,
}
View the results:
# Note that we're not pulling the last column (padj) since we haven't calculated these yet
1:5,1:2] pValue_m1[
## ID pval
## [1,] "10903987" "0.0812802229304083"
## [2,] "10714794" "0.757311314118124"
## [3,] "10858408" "0.390952310869689"
## [4,] "10872252" "0.0548937136005506"
## [5,] "10905819" "0.173539535577791"
Method 2 (m2): Apply Function
For the second method, we can use the apply() function to calculate resulting t-test p-values more efficiently labeled.
= apply(geodata_genes[,2:7], 1, function(x) t.test(x[unexposedIDs],
pValue_m2 $p.value)
x[exposedIDs])names(pValue_m2) = geodata_genes[,"ID"]
We can convert the results into a dataframe to make it similar to m1 matrix we created above.
= data.frame(pValue_m2)
pValue_m2
# Now create an ID column
$ID = rownames(pValue_m2) pValue_m2
Then we can view at the two datasets to see they result in the same pvalues.
head(pValue_m1)
## ID pval padj
## [1,] "10903987" "0.0812802229304083" "0"
## [2,] "10714794" "0.757311314118124" "0"
## [3,] "10858408" "0.390952310869689" "0"
## [4,] "10872252" "0.0548937136005506" "0"
## [5,] "10905819" "0.173539535577791" "0"
## [6,] "10907585" "0.215200167867295" "0"
head(pValue_m2)
## pValue_m2 ID
## 10903987 0.08128022 10903987
## 10714794 0.75731131 10714794
## 10858408 0.39095231 10858408
## 10872252 0.05489371 10872252
## 10905819 0.17353954 10905819
## 10907585 0.21520017 10907585
We can see from these results that both methods (m1 and m2) generate the same statistical p-values.
Interpreting Results
Let’s again merge these data with the gene symbols to tell which genes are significant.
First, let’s convert to a dataframe and then merge as before, for one of the above methods as an example (m1).
= data.frame(pValue_m1)
pValue_m1 = merge(pValue_m1, id.gene.table, by="ID") pValue_m1
We can also add a multiple test correction by applying a false discovery rate-adjusted p-value; here, using the Benjamini Hochberg (BH) method.
# Here fdr is an alias for B-H method
"padj"] = p.adjust(pValue_m1[,"pval"], method=c("fdr")) pValue_m1[,
Now, we can sort these statistical results by adjusted p-values.
= pValue_m1[order(pValue_m1[,'padj']),]
pValue_m1.sorted head(pValue_m1.sorted)
## ID pval padj Gene symbol
## 9143 10837582 4.57288413593085e-07 0.00732759 Olr633
## 5640 10783648 1.93688668590855e-06 0.01551834 Slc7a8
## 8 10701699 0.0166773380386967 0.13089115 Lrp11
## 17 10701817 0.0131845685452954 0.13089115 Fuca2
## 19 10701830 0.00586885826460337 0.13089115 Hivep2
## 23 10701880 0.00749149990409956 0.13089115 Reps1
Pulling just the significant genes using an adjusted p-value threshold of 0.05.
= pValue_m1[which(pValue_m1[,'padj'] < .05),]
adj.pval.sig
# Viewing these genes
adj.pval.sig
## ID pval padj Gene symbol
## 5640 10783648 1.93688668590855e-06 0.01551834 Slc7a8
## 9143 10837582 4.57288413593085e-07 0.00732759 Olr633
With this, we can answer Environmental Health Question 4: What genes are altered in expression by formaldehyde inhalation exposure?
Answer: Olr633 and Slc7a8.
Finally, let’s plot these using a mini heat map. Note that we can use probesetIDs, then gene symbols, in rownames to have them show in heat map labels.
Note that this statistical filter is pretty strict when comparing only n=3 vs n=3 biological replicates. If we loosen the statistical criteria to p-value < 0.05, this is what we can find:
= pValue_m1[which(pValue_m1[,'pval'] < .05),]
pval.sig nrow(pval.sig)
## [1] 5327
5327 genes with significantly altered expression!
Note that other filters are commonly applied to further focus these lists (e.g., background and fold change filters) prior to statistical evaluation, which can impact the final results. See Rager et al. 2013 for further statistical approaches and visualizations.
With this, we can answer Environmental Health Question 5: What are the potential biological consequences of these gene-level perturbations?
Answer: Olr633 stands for ‘olfactory receptor 633’. Olr633 is up-regulated in expression, meaning that formaldehyde inhalation exposure has a smell that resulted in ‘activated’ olfactory receptors in the nose of these exposed rats. Slc7a8 stands for ‘solute carrier family 7 member 8’. Slc7a8 is down-regulated in expression, and it plays a role in many biological processes, that when altered, can lead to changes in cellular homeostasis and disease.
Concluding Remarks
In conclusion, this training module provides an overview of pulling, organizing, visualizing, and analyzing -omics data from the online repository, Gene Expression Omnibus (GEO). Trainees are guided through the overall organization of an example high dimensional dataset, focusing on transcriptomic responses in the nasal epithelium of rats exposed to formaldehyde. Data are visualized and then analyzed using standard two-group comparisons. Findings are interpreted for biological relevance, yielding insight into the effects resulting from formaldehyde exposure.
For additional case studies that leverage GEO, see the following publications that also address environmental health questions from our research group:
Rager JE, Fry RC. The aryl hydrocarbon receptor pathway: a key component of the microRNA-mediated AML signalisome. Int J Environ Res Public Health. 2012 May;9(5):1939-53. doi: 10.3390/ijerph9051939. Epub 2012 May 18. PMID: 22754483; PMCID: PMC3386597.
Rager JE, Suh M, Chappell GA, Thompson CM, Proctor DM. Review of transcriptomic responses to hexavalent chromium exposure in lung cells supports a role of epigenetic mediators in carcinogenesis. Toxicol Lett. 2019 May 1;305:40-50. PMID: 30690063.