7.4 Database Integration: Air Quality, Mortality, and Environmental Justice Data

The development of this training module was led by Cavin Ward-Caviness with contributions from Alexis Payton.

Disclaimer: The views expressed in this document are those of the author and do not necessarily reflect the views or policies of the U.S. EPA.

Introduction to Training Module

This training module provides an example analysis based on the integration of data across multiple environmental health databases. This module specifically guides trainees through an explanation of how the data were downloaded and organized, and then details the loading of required packages and datasets. Then, code is provided for visualizing county-level air pollution measures, including PM2.5, NO2, and SO2. These measures were obtained through U.S. EPA monitoring stations are visualized here as the yearly average. Air pollution concentrations are then evaluated for potential relationship to the health outcome, mortality. Specifically, age-adjusted mortality rates are organized and statistically related to PM2.5 concentrations through linear regression modeling. Crude statistical models are first provided that do not take into account the influence of potential confounders. Then, statistical models are used that adjust for potential confounders, including adult smoking rates, obesity, food environment indicators, physical activity, employment status, rural vs urban living percentages, sex, ethnicity, and race. Results from these models point to the finding that areas with higher percentages of African-Americans may be experiencing higher impacts from PM2.5 on mortality. This relationship is of high interest, as it represents a potential environmental justice issue.

Introduction to Exposure and Health Databases

In this training module, we will use publicly available exposure and health databases to examine associations between air quality and mortality across the entire U.S. Specific databases that we will query include the following:

Training Module’s Environmental Health Questions

This training module was specifically developed to answer the following environmental health questions:

  1. What areas of the U.S. are most heavily monitored for air quality?
  2. Is there an association between long-term, ambient PM2.5 concentrations and mortality at the county level? Stated another way we are asking: Do counties with higher annual average PM2.5 concentrations also have higher all-cause mortality rates?
  3. What is the difference when running crude statistical models vs. statistical models that adjust for potential confounding, when evaluating the relationship between PM2.5 and mortality?
  4. Do observed associations differ when comparing between counties with a higher vs. lower percentage of African-Americans which can indicate environmental justice concerns?

Script Preparations

Cleaning the global environment

rm(list=ls())

Installing required R packages

If you already have these packages installed, you can skip this step, or you can run the below code which checks installation status for you

if (!requireNamespace("sf"))
  install.packages("sf");
if (!requireNamespace("tidyverse"))
  install.packages("tidyverse");
if (!requireNamespace("ggthemes"))
  install.packages("ggthemes");

Loading R packages required for this session

library(sf)
library(tidyverse)
library(ggthemes)
library(DT)

Set your working directory

setwd("/filepath to where your input files are")

Let’s start by loading the datasets needed for this training module. As detailed in the introduction, these data were previously downloaded and organized, and specifically made available for this training exercise as a compiled RDataset, containing organized dataframes ready to analyze.

We can now read in these organized data using the load() function.

load("Module7_4_Input/Module7_4_InputData.RData")

First let’s take a look at the geographic data, starting with the county-level shapefile (counties_shapefile). This dataframe contains the location information for the counties which we will use for visualizations.

head(counties_shapefile)
#> Simple feature collection with 6 features and 9 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -102.042 ymin: 37.38839 xmax: -84.79633 ymax: 43.49961
#> Geodetic CRS:  NAD83
#>   STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID      NAME LSAD      ALAND   AWATER
#> 1      19      107 00465242 0500000US19107 19107    Keokuk   06 1500067253  1929323
#> 2      19      189 00465283 0500000US19189 19189 Winnebago   06 1037261946  3182052
#> 3      20      093 00485011 0500000US20093 20093    Kearny   06 2254696689  1133601
#> 4      20      123 00485026 0500000US20123 20123  Mitchell   06 1817632928 44979981
#> 5      20      187 00485055 0500000US20187 20187   Stanton   06 1762104518   178555
#> 6      21      005 00516849 0500000US21005 21005  Anderson   06  522745702  6311537
#>                         geometry
#> 1 MULTIPOLYGON (((-92.41199 4...
#> 2 MULTIPOLYGON (((-93.97076 4...
#> 3 MULTIPOLYGON (((-101.5419 3...
#> 4 MULTIPOLYGON (((-98.49007 3...
#> 5 MULTIPOLYGON (((-102.0419 3...
#> 6 MULTIPOLYGON (((-85.16919 3...

These geographic data are represented by the following columns. (Some columns are not described below, since we don’t need them for these analyses.):

  • STATEFP: State FIPS code (1 or 2 digits)
  • COUNTYFP: County FIPS Code (3 digits)
  • GEOID: Geographic identifier that combines the state and county FIPS codes
  • NAME: County name
  • geometry: Latitude and longitude coordinates

Now let’s view the EPA Air Quality Survey (AQS) data (epa_ap_county) collected from 2016. This dataframe represents county-level air quality measures, as detailed above. This dataframe is in melted (or long) format, meaning that different air quality measures are organized across rows, with variable measure indicators in the Parameter.Name, Units.of.Measure, and County_Avg columns.

head(epa_ap_county)
#>   State.Code State.Name County.Code County.Name State_County_Code Parameter.Name
#> 1          1    Alabama           3     Baldwin              1003           PM25
#> 2          1    Alabama          27        Clay              1027           PM25
#> 3          1    Alabama          33     Colbert              1033           PM25
#> 4          1    Alabama          49      DeKalb              1049           PM25
#> 5          1    Alabama          55      Etowah              1055           PM25
#> 6          1    Alabama          69     Houston              1069           PM25
#>              Units.of.Measure County_Avg
#> 1 Micrograms/cubic meter (LC)   7.226446
#> 2 Micrograms/cubic meter (LC)   7.363793
#> 3 Micrograms/cubic meter (LC)   7.492241
#> 4 Micrograms/cubic meter (LC)   7.695690
#> 5 Micrograms/cubic meter (LC)   8.195575
#> 6 Micrograms/cubic meter (LC)   7.061538

These air quality data are represented by the following columns:

  • State.Code: State FIPS code (1 or 2 digits)
  • State.Name: State name
  • County.Code: County FIPS code (1-3 digits)
  • County.Name: County name
  • State_County_Code: Combined state and county code (separated by a 0)
  • Parameter.Name: Name of the air pollutant
  • Units.of.Measure: Units of measurement
  • County_Avg: County average

These data can be restructured to view air quality measures as separate variables labeled across columns using:

# transform from the "long" to "wide" format for the pollutants
epa_ap_county <- epa_ap_county %>% 
  select(-Units.of.Measure) %>% 
  unique() %>% 
  tidyr::spread(Parameter.Name, County_Avg)

head(epa_ap_county)
#>   State.Code State.Name County.Code County.Name State_County_Code NO2     PM25 SO2
#> 1          1    Alabama           3     Baldwin              1003  NA 7.226446  NA
#> 2          1    Alabama          27        Clay              1027  NA 7.363793  NA
#> 3          1    Alabama          33     Colbert              1033  NA 7.492241  NA
#> 4          1    Alabama          49      DeKalb              1049  NA 7.695690  NA
#> 5          1    Alabama          55      Etowah              1055  NA 8.195575  NA
#> 6          1    Alabama          69     Houston              1069  NA 7.061538  NA

Note that we can now see the specific pollutant variables NO2, PM25, and SO2 on the far right.

Population-Weighted vs. Unweighted Exposures

Here we pause briefly to speak on population-weighted vs unweighted exposures. The analysis we will be undertaking is known as an “ecological” analysis where we are looking at associations by area, e.g. county. When studying environmental exposures by area a common practice is to try to weight the exposures by the population so that exposures better represent the “burden” faced by the population. Ideally for this you would want a systematic model or assessment of the exposure that corresponded with a fine-scale population estimate so that for each county you could weight exposures within different areas of the county by the population exposed. This sparse monitor data (we will examine the population covered later in the tutorial) is not population weighted, but should you see similar analyses with population weighting of exposures you should simply be aware that this better captures the “burden” of exposure experienced by the population within the area estimated, typically zip code or county.

Now let’s view the CDC’s mortality dataset collected from 2016 (cdc_mortality):

head(cdc_mortality)
#>   Notes             County County.Code Deaths Population Crude.Rate Age.Adjusted.Rate
#> 1    NA Autauga County, AL        1001    520      55416     938.36            884.40
#> 2    NA Baldwin County, AL        1003   1974     208563     946.48            716.92
#> 3    NA Barbour County, AL        1005    256      25965     985.94            800.68
#> 4    NA    Bibb County, AL        1007    239      22643    1055.51            927.67
#> 5    NA  Blount County, AL        1009    697      57704    1207.89            989.37
#> 6    NA Bullock County, AL        1011    133      10362    1283.54           1063.00
#>   Age.Adjusted.Rate.Standard.Error
#> 1                            39.46
#> 2                            16.58
#> 3                            51.09
#> 4                            61.03
#> 5                            38.35
#> 6                            93.69

These mortality data are represented by the following columns. We’ll just ignore the Notes column for our purposes:

  • County: County name with the state abbreviation
  • County.Code: Combined state and county code (separated by a 0)
  • Deaths: Number of deaths in 2016
  • Population: County population in 2016
  • Crude.Rate: Death rate (\(\frac{Number~of~Deaths}{Population}* 100,000\))
  • Age.Adjusted.Rate: age-adjusted death rate (\(\sum{(Age~Specific~Death~Rate * Standard~Population~Weight)} * 100,000\))
  • Age.Adjusted.Rate.Standard.Error: Standard error of the age-adjusted rate

We can create a visualization of the age-adjusted death rate and air pollutants throughout the U.S. to further inform what these data look like:

# Can merge them by the FIPS county code which we need to create for the counties_shapefile
counties_shapefile$State_County_Code <- as.character(as.numeric(paste0(counties_shapefile$STATEFP, counties_shapefile$COUNTYFP)))

# Let's merge in the air pollution and mortality data and plot it
counties_shapefile <- merge(counties_shapefile, epa_ap_county, by.x = "State_County_Code", by.y = "State_County_Code", all.x=TRUE)
counties_shapefile <- merge(counties_shapefile, cdc_mortality, by.x = "State_County_Code", by.y = "County.Code")

# Will remove alaska and hawaii just so we can look at the continental USA
county_plot <- subset(counties_shapefile, !STATEFP %in% c("02","15"))

# We can start with a simple plot of age-adjusted mortality rate, PM2.5, NO2, and SO2 levels across the U.S.
plot(county_plot[,c("Age.Adjusted.Rate","PM25","NO2","SO2")])

You can see that these result in the generation of four different nationwide plots, showing the distributions of age-adjusted mortality rates, PM2.5 concentrations, NO2 concentrations, and SO2 concentrations, averaged per county.

Let’s make a nicer looking plot with ggplot(), looking just at PM2.5 levels:

ggplot(data = county_plot) + 
  geom_sf(aes(fill = PM25)) +
  
  # Changing colors
  scale_fill_viridis_c(option ="plasma", name ="PM2.5", 
                       guide = guide_colorbar(
                         direction = "horizontal",
                         barheight = unit(2, units = "mm"),
                         barwidth = unit(50, units = "mm"),
                         draw.ulim = F,
                         title.position = 'top',
                         # some shifting around
                         title.hjust = 0.5,
                         label.hjust = 0.5)) +
  ggtitle(expression(2016~Annual~PM[2.5]~EPA~Monitors)) + 
  theme_map() +
  theme(plot.title = element_text(hjust = 0.5, size = 22)) 

Answer to Environmental Health Question 1

With this, we can answer Environmental Health Question #1: What areas of the U.S. are most heavily monitored for air quality?

Answer: We can tell from the PM2.5 specific plot that air monitors are densely located in California, and other areas with high populations (including the East Coast), while large sections of central U.S. lack air monitoring data.


Analyzing Relationships between PM2.5 and Mortality

Crude Model

Now the primary question is whether counties with higher PM2.5 also have higher mortality rates. To answer this question, first we need to perform some data merging in preparation for this analysis.

# Merging mortality and air pollution data
model_data <- merge(epa_ap_county, cdc_mortality, by.x = "State_County_Code", by.y = "County.Code")

As we saw in the above plot, only a portion of the USA is covered by PM2.5 monitors. Let’s see what our population coverage is

sum(model_data$Population, na.rm = TRUE)
#> [1] 232169063
sum(cdc_mortality$Population, na.rm = TRUE)
#> [1] 323127513
sum(model_data$Population, na.rm = TRUE)/sum(cdc_mortality$Population, na.rm = TRUE)*100
#> [1] 71.8506

We can do a quick visual inspection of this using a scatter plot which will also let us check for unexpected distributions of the data (always a good idea)

ggplot(model_data) + 
  geom_point(aes(x = Age.Adjusted.Rate, y = PM25)) +
  ggtitle(expression(PM[2.5]~by~Mortality~Rate)) + 
  xlab('Age-adjusted Mortality Rate') + ylab(expression(PM[2.5]))  # changing axis labels 

The univariate correlation is a simple way of quantifying this potential relationship, though it does not nearly tell the complete story. Just as a starting point, let’s run this simple univariate correlation calculation using the cor() function.

cor(model_data$Age.Adjusted.Rate, model_data$PM25, use = "complete.obs")
#> [1] 0.178549

Now, let’s obtain a more complete estimate of the data through regression modeling. As an initial starting point, let’s run this model without any confounders (also known as a ‘crude’ model).

A simple linear regression model in R can be carried out using the lm() function. Here, we are evaluating age-adjusted mortality rate (age.adjusted.rate) as the dependent variable, and PM2.5 as the independent variable. Values used in evaluating this model were weighted to adjust for the fact that some counties have higher precision in their age-adjusted mortality rate (represented by a smaller age-adjusted rate standard error).

# running the linear regression model
m <- lm(Age.Adjusted.Rate ~ PM25,
        data = model_data, weights = 1/model_data$Age.Adjusted.Rate.Standard.Error)   
# viewing the model results through the summary function
summary(m)   
#> 
#> Call:
#> lm(formula = Age.Adjusted.Rate ~ PM25, data = model_data, weights = 1/model_data$Age.Adjusted.Rate.Standard.Error)
#> 
#> Weighted Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -110.096   -9.792    8.363   25.090   84.316 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  661.038     22.528  29.343  < 2e-16 ***
#> PM25           8.956      2.878   3.112  0.00195 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 29.93 on 608 degrees of freedom
#>   (88 observations deleted due to missingness)
#> Multiple R-squared:  0.01568,    Adjusted R-squared:  0.01406 
#> F-statistic: 9.685 on 1 and 608 DF,  p-value: 0.001945

Shown here are summary level statistics summarizing the results of the linear regression model.

In the model summary we see several features:

  • Estimate: the regression coefficient which tells us the relationship between a 1 ug/m3 change (elevation) in PM2.5 and the age-adjusted all-cause mortality rate
  • Std. Error: the standard error of the estimate
  • t value: represents the T-statistic which is the test statistic for linear regression models and is simply the Estimate divided by Std. Error. This t value is compared with the Student’s T distribution in order to determine the p-value (Pr(>|t|)).

The residuals are the difference between the predicted outcome (age-adjusted mortality rate) and known outcome from the data. For linear regression to be valid this should be normally distributed. The residuals from a linear model can be extracted using the residuals() function and plotted to see their distribution.

Answer to Envrionmental Health Question 2

With this, we can answer Environmental Health Question #2: Is there an association between long-term, ambient PM2.5 concentrations and mortality at the county level?

Answer: Based on these model results, there may indeed be an association between PM2.5 concentrations and mortality (p=0.0019)


Adjusting for Covariates

To more thoroughly examine the potential relationship between PM2.5 concentrations and mortality it is absolutely essential to adjust for confounders. Let’s start by viewing the human covariate data that contains some confounders of interest.

datatable(county_health)

These covariate data are represented by the following columns:

  • State.FIPS.Code: Single digit code assigned to each state
  • County.FIPS.Code: FIPS code unique to counties within a particular state (1 digit)
  • County.5.digit.FIPS.Code: Unique county FIPS code (5 digits)
  • State.Abbreviation: State abbreviation
  • Name: County Name (or state name that contains the state average)
  • Release.Year: Year the data was publicly released
  • Various confounders indicative of race, ethnicity, socioeconomic status, population density, education, health-related variables in columns 7-36
# Merging the covariate data in with the AQS data
model_data <- merge(model_data, county_health, by.x = "State_County_Code", 
                    by.y = "County.5.digit.FIPS.Code", all.x = TRUE)

# Now we add some relevant confounders to the linear regression model
m <- lm(Age.Adjusted.Rate ~ PM25 + Adult.smoking + Adult.obesity + Food.environment.index + 
          Physical.inactivity + High.school.graduation + Some.college + Unemployment + 
          Violent.crime + Percent.Rural + Percent.Females + Percent.Asian + 
          Percent.Non.Hispanic.African.American + Percent.American.Indian.and.Alaskan.Native + 
          Percent.NonHispanic.white, data = model_data, weights = 1/model_data$Age.Adjusted.Rate.Standard.Error)

# And finally we check to see if the statistical association persists
summary(m)
#> 
#> Call:
#> lm(formula = Age.Adjusted.Rate ~ PM25 + Adult.smoking + Adult.obesity + 
#>     Food.environment.index + Physical.inactivity + High.school.graduation + 
#>     Some.college + Unemployment + Violent.crime + Percent.Rural + 
#>     Percent.Females + Percent.Asian + Percent.Non.Hispanic.African.American + 
#>     Percent.American.Indian.and.Alaskan.Native + Percent.NonHispanic.white, 
#>     data = model_data, weights = 1/model_data$Age.Adjusted.Rate.Standard.Error)
#> 
#> Weighted Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -70.059  -7.055   0.960   8.119  64.052 
#> 
#> Coefficients:
#>                                              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                                 616.47976  157.20842   3.921 1.00e-04 ***
#> PM25                                          3.84851    1.69321   2.273 0.023444 *  
#> Adult.smoking                               859.27337  137.81312   6.235 9.43e-10 ***
#> Adult.obesity                               605.84315   97.71952   6.200 1.16e-09 ***
#> Food.environment.index                      -28.65536    3.93359  -7.285 1.22e-12 ***
#> Physical.inactivity                         117.60916   91.41519   1.287 0.198834    
#> High.school.graduation                       55.14446   40.78490   1.352 0.176944    
#> Some.college                               -244.41255   49.78764  -4.909 1.23e-06 ***
#> Unemployment                                 97.98159  161.89025   0.605 0.545290    
#> Violent.crime                                 0.07551    0.01516   4.982 8.62e-07 ***
#> Percent.Rural                               -12.55308   20.84834  -0.602 0.547364    
#> Percent.Females                            -271.45504  299.16925  -0.907 0.364640    
#> Percent.Asian                                74.54327   55.89262   1.334 0.182897    
#> Percent.Non.Hispanic.African.American       153.39910   39.49623   3.884 0.000116 ***
#> Percent.American.Indian.and.Alaskan.Native  200.70010  102.75028   1.953 0.051329 .  
#> Percent.NonHispanic.white                   240.17976   26.30538   9.130  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 14.22 on 514 degrees of freedom
#>   (168 observations deleted due to missingness)
#> Multiple R-squared:  0.7841, Adjusted R-squared:  0.7778 
#> F-statistic: 124.4 on 15 and 514 DF,  p-value: < 2.2e-16

Answer to Envrionmental Health Question 3

With this, we can answer Environmental Health Question #3: What is the difference when running crude statistical models vs. statistical models that adjust for potential confounding, when evaluating the relationship between PM2.5 and mortality?

Answer: The relationship between PM2.5 and mortality remains statistically significant when confounders are considered (p=0.023), though is not as significant as when running the crude model (p=0.0019).


Environmental Justice Considerations

Environmental justice is the study of how societal inequities manifest in differences in environmental health risks either due to greater exposures or a worse health response to exposures. Racism and racial discrimination are major factors in both how much pollution people are exposed to as well what their responses might be due to other co-existing inequities (e.g. poverty, access to healthcare, food deserts). Race is a commonly used proxy for experiences of racism and racial discrimination.

Here we will consider the race category of Non-Hispanic African-American to investigate if pollution levels differ by percent African-Americans in a county and if associations between PM2.5 and mortality also differ by this variable, which could indicate environmental justice-relevant issues revealed by this data. We will specifically evaluate data distributions across counties with the highest percentage of African-Americans (top 25%) vs. lowest percentage of African-Americans (bottom 25%).

First let’s visualize the distribution of African-American percentage in these data

ggplot(model_data) + 
  geom_histogram(aes(x = Percent.Non.Hispanic.African.American*100)) + 
  ggtitle("Percent African-American by County") + 
  xlab('Percent')

Let’s look at a summary of the data

summary(model_data$Percent.Non.Hispanic.African.American)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#> 0.00056 0.01220 0.03995 0.09415 0.12282 0.70458      67

We can compute quartiles of the data

model_data$AA_quartile <- with(model_data, cut(Percent.Non.Hispanic.African.American, 
                                breaks = quantile(Percent.Non.Hispanic.African.American, probs = seq(0,1, by = 0.25), na.rm = TRUE), 
                                include.lowest = TRUE, ordered_result = TRUE, labels = FALSE))

Then we can use these quartiles to see that as the Percent African-American increases so does the PM2.5 exposure by county

AA_summary <- model_data %>% 
  filter(!is.na(Percent.Non.Hispanic.African.American)) %>% 
  group_by(AA_quartile) %>% 
  summarise(Percent_AA = mean(Percent.Non.Hispanic.African.American, na.rm = TRUE), Mean_PM25 = mean(PM25, na.rm = TRUE))

AA_summary
#> # A tibble: 4 × 3
#>   AA_quartile Percent_AA Mean_PM25
#>         <int>      <dbl>     <dbl>
#> 1           1    0.00671      5.97
#> 2           2    0.0238       7.03
#> 3           3    0.0766       7.93
#> 4           4    0.269        8.05

Now that we can see this trend, let’s add some statistics. Let’s specifically compare the relationships between PM2.5 and mortality within the bottom 25% AA counties (quartile 1); and also the highest 25% AA counties (quartile 4)

# first need to subset the data by these quartiles of interest
low_AA <- subset(model_data, AA_quartile == 1)
high_AA <- subset(model_data, AA_quartile == 4)

# then we can run the relevant statistical models
m.low <- lm(Age.Adjusted.Rate ~ PM25 + Adult.smoking + Adult.obesity + Food.environment.index + Physical.inactivity +
          High.school.graduation + Some.college + Unemployment + Violent.crime + Percent.Rural + Percent.Females +
          Percent.Asian + Percent.American.Indian.and.Alaskan.Native + Percent.NonHispanic.white,
        data = low_AA, weights = 1/low_AA$Age.Adjusted.Rate.Standard.Error)

m.high <- lm(Age.Adjusted.Rate ~ PM25 + Adult.smoking + Adult.obesity + Food.environment.index + Physical.inactivity +
          High.school.graduation + Some.college + Unemployment + Violent.crime + Percent.Rural + Percent.Females +
          Percent.Asian + Percent.American.Indian.and.Alaskan.Native + Percent.NonHispanic.white,
        data = high_AA, weights = 1/high_AA$Age.Adjusted.Rate.Standard.Error)


# We see a striking difference in the associations
rbind(c("Bottom 25% AA Counties",round(summary(m.low)$coefficients["PM25",c(1,2,4)],3)),
      c("Top 25% AA Counties",round(summary(m.high)$coefficients["PM25",c(1,2,4)],3)))
#>                               Estimate Std. Error Pr(>|t|)
#> [1,] "Bottom 25% AA Counties" "4.782"  "3.895"    "0.222" 
#> [2,] "Top 25% AA Counties"    "14.552" "4.13"     "0.001"

Answer to Envrionmental Health Question 4

With this, we can answer Environmental Health Question #4: Do observed associations differ when comparing between counties with a higher vs. lower percentage of African-Americans which can indicate environmental justice concerns?

Answer: Yes. Counties with the highest percentage of African-Americans (top 25%) demonstrated a highly significant association between PM2.5 and age-adjusted mortality, even when adjusting for confounders (p=0.001), meaning that the association between PM2.5 and mortality within these counties may be exacerbated by factors relevant to race. Conversely, counties with the lowest percentages of African-Americans (bottom 25%) did not demonstrate a significant association between PM2.5 and age-adjusted mortality, indicating that these counties may have lower environmental health risks due to factors correlated with race.


Concluding Remarks

In conclusion, this training module serves as a novel example data integration effort of high relevance to environmental health issues. Databases that were evaluated here span exposure data (i.e., Air Quality System data), health outcome data (i.e., mortality data), and county-level characteristics on healthcare, food environment, and other potentially relevant confounders (i.e., county-level variables that may impact observed relationships), and environmental justice data (e.g., race). Many different visualization and statistical approaches were used, largely based on linear regression modeling and county-level characteristic stratifications. These example statistics clearly support the now-established relationship between PM2.5 concentrations in the air and mortality. Importantly, these related methods can be tailored to address new questions to increase our understanding between exposures to chemicals in the environment and adverse health outcomes, as well as the impact of different individual or area characteristics on these relationships - particularly those that might relate to environmental justice concerns.


This training module provided some examples looking at PM2.5 concentration data. Using the same “Module7_4_InputData.data” file (also saved as “Module7_4_TYKInput.data”), let’s ask similar questions but now looking at NO2 concentration data.

  1. Is there an association between long-term, ambient NO2 concentrations and age-adjusted mortality at the county level (i.e., the crude model results)?
  2. After adjusting for covariates, is there an association between long-term, ambient NO2 concentrations and age-adjusted mortality at the county level (i.e., the adjusted model results)?