7.3 CompTox Dashboard Data through APIs

This training module was developed by Paul Kruse and Caroline Ring, with contributions from Julia E. Rager.

All input files (script, data, and figures) can be downloaded from the UNC-SRP TAME2 GitHub website.

Disclaimer: The views expressed in this document are those of the authors and do not necessarily reflect the views or policies of the U.S. EPA.

Introduction to Training Module

Environmental health research related to chemical exposures often requires accessing and wrangling chemical-specific data. The CompTox Chemicals Dashboard (CCD), developed by the United States Environmental Protection Agency, is a publicly-accessible database that integrates chemical data from multiple domains. Chemical data available on the CCD include physicochemical, environmental fate and transport, exposure, toxicokinetics, functional use, in vivo toxicity, in vitro bioassay, and mass spectra data. The CCD was first described in Williams et al. (2017), and has been continuously expanded since. The CCD is heavily used by researchers who do cheminformatics work of various kinds – computational toxicology, computational exposure science, analytical chemistry, chemical safety assessment, etc. The CCD is used by cheminformaticians not only at EPA, but across governmental agencies both within the U.S. and worldwide; in private industry; in non-governmental organizations; in academia; and others. It has become an indispensable tool for many researchers.

This training module provides an overview of the physico-chemical, hazard, and bioactivity data available through the CCD; different ways to access these data; and some examples of how these data may be used. We will first introduce the CCD and how to access it. Then we will focus on an automated, programmatic method for retrieving data from the CCD using the ctxR R package. Through some basic data visualization and analysis using the R programming language, we will explore some data retrieved from the CCD, and gain insights both in how to wrangle the data and combine different methods of accessing the data to build automated pipelines for use in more complex settings.

Note, as the ctxR package accesses data that is periodically updated, some code chunks will produce numbers that may change slightly with data updates. Keep this in mind when running these code chunks in the future.

Training Module’s Environmental Health Questions

This training module was specifically developed to answer the following questions:

  1. After automatically pulling the fourth Drinking Water Contaminant Candidate List from the CompTox Chemicals Dashboard, list the properties and property types present in the data. What are the mean values for a specific property when grouped by property type and when ungrouped?

  2. The physico-chemical property data are reported with both experimental and predicted values present for many chemicals. Are there differences between the mean predicted and experimental results for a variety of physico-chemical properties?

  3. After pulling the genotoxicity data for the different environmental contaminant data sets, list the assays associated with the chemicals in each data set. How many unique assays are there in each data set? What are the different assay categories and how many unique assays for each assay category are there?

  4. The genotoxicity data contains information on which assays have been conducted for different chemicals and the results of those assays. How many chemicals in each data set have a ‘positive’, ‘negative’, and ‘equivocal’ value for the assay result?

  5. Based on the genotoxicity data reported for the chemical with DTXSID identifier DTXSID0020153, how many assays resulted in a positive/equivocal/negative value? Which of the assays were positive and how many of each were there for the most reported assays?

  6. After pulling the hazard data for the different data sets, list the different exposure routes for which there is data. What are the unique risk assessment classes for hazard values for the oral route and for the inhalation exposure route? For each such exposure route, which risk assessment class is most represented by the data sets?

  7. There are several types of toxicity values for each exposure route. List the unique toxicity values for the oral and inhalation routes. What are the unique types of toxicity values for the oral route and for the inhalation route? How many of these are common to both the oral and inhalation routes for each data set?

  8. When examining different toxicity values, the data may be reported in multiple units. To assess the relative hazard from this data, it is important to take into account the different units and adjust accordingly. List the units reported for the cancer slope factor, reference dose, and reference concentration values associated with the oral and inhalation exposure routes for human hazard. Which chemicals in each data set have the highest cancer slope factor, lowest reference dose, and lowest reference concentration values?

Script Preparations

Cleaning the Global Environment

rm(list=ls())

Installing Required R Packages

if (!requireNamespace('ctxR'))
  install.packages('ctxR')

if (!requireNamespace('ggplot2'))
  install.packages('ggplot2')

Loading R Packages

# Used to interface with CompTox Chemicals Dashboard
library(ctxR)

# Used to visualize data in a variety of plot designs
library(ggplot2)

Introduction to CompTox Chemicals Dashboard

Accessing chemical data and wrangling it is a vital step in many types of workflows related to chemical, biological, and environmental modeling. While there are many resources available from which one can pull data, the CompTox Chemicals Dashboard built and maintained by the United States Environmental Protection Agency is particularly well-designed and suitable for these purposes. Originally introduced in The CompTox Chemistry Dashboard: a community data resource for environmental chemistry, the CCD contains information on over 1.2 million chemicals as of December 2023.

The CCD includes chemical information from many different domains, including physicochemical, environmental fate and transport, exposure, usage, in vivo toxicity, and in vitro bioassay data (Williams et al., 2017).

The CCD can be searched either one chemical at a time, or using a batch search.

CCTE’s CTX Application Programming Interfaces (APIs) for Automated Batch Search of the CCD

Recently, the Center for Computational Toxicology and Exposure (CCTE) developed a set of Application Programming Interfaces (APIs) that allows programmatic access to the CCD, bypassing the manual steps of the web-based batch search workflow. The Computational Toxicology and Exposure (CTX) APIs effectively automate the process of accessing and downloading data from the web pages that make up the CCD.

The CTX APIs are publicly available at no cost to the user. However, in order to use the CTX APIs, you must have an API key. The API key uniquely identifies you to the CTX servers and verifies that you have permission to access the database. Getting an API key is free, but requires contacting the API support team at ccte_api@epa.gov.

For more information on the data accessible through the CTX APIs and related tools, please visit the US EPA page on Computational Toxicology and Exposure Online Resources. The CTX APIs are one of many resources developed within this research realm and make available many of the data resources beyond the CCD.

The APIs are organized into four sets of “endpoints” (chemical data domains): Chemical, Hazard, Bioactivity, and Exposure. Pictured below is what the Chemical section looks like and can be found at CTX API Chemical Endpoints.

The APIs can be explored through the pictured web interface at https://api-ccte.epa.gov/docs/chemical.html .

CTX API Authentication

Authentication is the first tab on the left. Authentication is required to use the APIs. To authenticate yourself in the API web interface, input your unique API key.

CTX API Endpoints

On the left of the API web interface, there are several different tabs, one for each endpoint in the Chemical domain. The endpoints are organized by the type of information provided. For instance, the Chemical Details Resource endpoint provides basic chemical information; the Chemical Property Resource endpoint provides more comprehensive physico-chemical property information; the Chemical Fate Resource endpoint provides chemical fate and transport information; and so on.

Constructing CTX API Requests

As mentioned above, APIs effectively automate the process of accessing and downloading data from the web pages that make up the CCD. APIs do this by automatically constructing requests using the Hypertext Transfer Protocol (HTTP) that enables communication between clients (e.g. your computer) and servers (e.g. the CCD).

In the CTX API web interface, the colored boxes next to each endpoint indicate the type of the associated HTTP method: either a GET request (“GET”, blue) or a a POST request (“POS”, green). GET is used to request data from a specific web resource (e.g. a specific URL); POST is used to send data to a server to create or update a web resource. For the CTX APIs, POST requests are used to perform multiple (batch) searches in a single API call; GET requests are used for non-batch searches. You do not need to understand the details of POST and GET requests in order to use the API.

Click on the second item under Chemical Details Resource, the tab labeled Get data by dtxsid. The following page will appear.

This page has two subheadings: “Path Parameters” and “Query-String Parameters”. “Path Parameters” contains user-specified parameters that are required in order to tell the API what URL (web address) to access. In this case, the required parameter is a string for the DTXSID identifying the chemical to be searched.

“Query-String Parameters” contain user-specific parameters (usually optional) that tell the API what specific type(s) of information to download from the specified URL. In this case, the optional parameter is a projection parameter, a string that can take one of five values (chemicaldetailall, chemicaldetailstandard, chemicalidentifier, chemicalstructure, ntatoolkit). Depending on the value of this string, the API can return different sets of information about the chemical. If the projection parameter is left blank, then a default set of chemical information is returned.

The default return format is displayed below and includes a variety of fields with data types represented.

We show what returned data from searching Bisphenol A looks like using this endpoint with the chemicaldetailstandard value for projection selected.

Formatting an http request is not necessarily intuitive nor worth the time for someone not already familiar with the process, so these endpoints may provide a resource that for many would require a significant investment in time and energy to learn how to use. However, there is a solution to this in the form of the R package ctxR.

ctxR was developed to streamline the process of accessing the information available through the CTX APIs without requiring prior knowledge of how to use APIs. The ctxR package is available in stable form on CRAN and a development version may be found at the USEPA ctxR GitHub repository. As an example, we demonstrate the ease with which one may retrieve the information given by this endpoint for Bisphenol A using the ctxR approach and contrast it with the approach using the CCD website or CTX Chemical API Endpoint website.

Setting, using, and storing the API key

We store the API key required to access the APIs. To do this for the current session, run the first command. If you want to store your key across multiple sessions, run the second command.

# This stores the key in the current session
register_ctx_api_key(key = '<YOUR API KEY>')

# This stores the key across multiple sessions and only needs to be run once. 
# If the key changes, rerun this with the new key.
register_ctx_api_key(key = '<YOUR API KEY>', write = TRUE)

To check that your key has successfully been stored for the session, run the following command.

ctx_key()

Retrieving chemical details

Now, we demonstrate how to retrieve the information for BPA given by the Chemical Detail Resource endpoint under the chemicaldetailstandard value for projection. Note, this projection value is the default value for the function get_chemical_details().

BPA_chemical_detail <- get_chemical_details(DTXSID = 'DTXSID7020182')
dim(BPA_chemical_detail)
#> [1]  1 37
class(BPA_chemical_detail)
#> [1] "data.table" "data.frame"
names(BPA_chemical_detail)
#>  [1] "id"                    "cpdataCount"           "inchikey"             
#>  [4] "wikipediaArticle"      "monoisotopicMass"      "percentAssays"        
#>  [7] "pubchemCount"          "pubmedCount"           "sourcesCount"         
#> [10] "qcLevel"               "qcLevelDesc"           "isotope"              
#> [13] "multicomponent"        "totalAssays"           "pubchemCid"           
#> [16] "relatedSubstanceCount" "relatedStructureCount" "casrn"                
#> [19] "compoundId"            "genericSubstanceId"    "preferredName"        
#> [22] "activeAssays"          "molFormula"            "hasStructureImage"    
#> [25] "iupacName"             "smiles"                "inchiString"          
#> [28] "qcNotes"               "qsarReadySmiles"       "msReadySmiles"        
#> [31] "irisLink"              "pprtvLink"             "descriptorStringTsv"  
#> [34] "isMarkush"             "dtxsid"                "dtxcid"               
#> [37] "toxcastSelect"

Comparing Physico-chemical Properties between Two Important Environmental Contaminant Lists

We study two different data sets contained in the CCD and observe how they relate and how they differ. The two data sets that we will explore are a water contaminant priority list and an air toxics list.

The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that “…are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems….” Moreover, this list “…was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants….” The National-Scale Air Toxics Assessments (NATA) is “… EPA’s ongoing comprehensive evaluation of air toxics in the United States… a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks… use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them….”

These lists can be found in the CCD at CCL4 with additional information at CCL4 information and NATADB with additional information at NATA information. The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links.

We explore details about these two lists of chemicals before diving into analyzing the data contained in each list.

options(width = 100)
ccl4_information <- get_public_chemical_list_by_name('CCL4')
print(ccl4_information, trunc.cols = TRUE)
#>    id    type                                    label visibility
#> 1 443 federal WATER|EPA: Chemical Contaminants - CCL 4     PUBLIC
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             longDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that, at the time of publication, are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems. Contaminants listed on the CCL may require future regulation under the Safe Drinking Water Act (SDWA). EPA announced the <a href='https://www.epa.gov/ccl/contaminant-candidate-list-4-ccl-4-0' target='_blank'>fourth Drinking Water Contaminant Candidate List (CCL 4)</a> on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants. The group of cyanotoxins on CCL 4 includes, but is not limited to: anatoxin-a, cylindrospermopsin, microcystins, and saxitoxin. The CCL Chemical Candidate Lists are versioned iteratively and this description navigates between the various versions of the lists. The list of substances displayed below represents only the chemical CCL 4 contaminants. For the versioned lists, please use the hyperlinked lists below.<br/><br/> \r\n\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL5' target='_blank'>CCL5 - November 2022</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL4' target='_blank'>CCL4 - November 2016</a> \r\n This list<br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL3' target='_blank'>CCL3 - October 2009</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL2' target='_blank'>CCL2 - February 2005</a><br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL1' target='_blank'>CCL1 - March 1998</a><br/><br/> 
#>   chemicalCount            createdAt            updatedAt listName
#> 1           100 2017-12-28T17:58:36Z 2022-10-26T21:14:27Z     CCL4
#>                                                                                                                                              shortDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that are known or anticipated to occur in public water systems. Version 4 is known as CCL 4.

natadb_information <- get_public_chemical_list_by_name('NATADB')
print(natadb_information, trunc.cols = TRUE)
#>    id    type                                            label visibility
#> 1 454 federal EPA: National-Scale Air Toxics Assessment (NATA)     PUBLIC
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              longDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States. EPA developed the NATA as a state-of-the-science screening tool for State/Local/Tribal Agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks.  NATA assessments do not incorporate refined information about emission sources but, rather, use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them.\r\n\r\nNATA provides estimates of the risk of cancer and other serious health effects from breathing (inhaling) air toxics in order to inform both national and more localized efforts to identify and prioritize air toxics, emission source types and locations which are of greatest potential concern in terms of contributing to population risk.  This in turn helps air pollution experts focus limited analytical resources on areas and or populations where the potential for health risks are highest.  Assessments include estimates of cancer and non-cancer health effects based on chronic exposure from outdoor sources, including assessments of non-cancer health effects for Diesel Particulate Matter (PM). Assessments provide a snapshot of the outdoor air quality and the risks to human health that would result if air toxic emissions levels remained unchanged.
#>   listName chemicalCount            createdAt            updatedAt
#> 1   NATADB           163 2018-02-21T12:04:16Z 2018-11-16T21:42:01Z
#>                                                                                                                shortDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States.

Now we pull the actual chemicals contained in the lists using the APIs.

ccl4 <- get_chemicals_in_list('ccl4')
ccl4 <- data.table::as.data.table(ccl4)

natadb <- get_chemicals_in_list('NATADB')
natadb <- data.table::as.data.table(natadb)

We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.

dim(ccl4)
#> [1] 100  37
dim(natadb)
#> [1] 163  37

colnames(ccl4)
#>  [1] "id"                    "cpdataCount"           "inchikey"              "wikipediaArticle"     
#>  [5] "monoisotopicMass"      "percentAssays"         "pubchemCount"          "pubmedCount"          
#>  [9] "sourcesCount"          "qcLevel"               "qcLevelDesc"           "isotope"              
#> [13] "multicomponent"        "totalAssays"           "pubchemCid"            "relatedSubstanceCount"
#> [17] "relatedStructureCount" "casrn"                 "compoundId"            "genericSubstanceId"   
#> [21] "preferredName"         "activeAssays"          "molFormula"            "hasStructureImage"    
#> [25] "iupacName"             "smiles"                "inchiString"           "qcNotes"              
#> [29] "qsarReadySmiles"       "msReadySmiles"         "irisLink"              "pprtvLink"            
#> [33] "descriptorStringTsv"   "isMarkush"             "dtxsid"                "dtxcid"               
#> [37] "toxcastSelect"
head(ccl4, 1)
#>        id cpdataCount inchikey wikipediaArticle monoisotopicMass percentAssays pubchemCount
#>    <char>       <int>   <char>           <char>            <num>         <num>        <int>
#> 1: 627129          NA     <NA>             <NA>               NA            NA           NA
#> 30 variable(s) not shown: [pubmedCount <num>, sourcesCount <int>, qcLevel <int>, qcLevelDesc <char>, isotope <int>, multicomponent <int>, totalAssays <int>, pubchemCid <int>, relatedSubstanceCount <int>, relatedStructureCount <int>, ...]

Accessing the Physico-chemical Property Data

Once we have the chemicals in each list, we access their physico-chemical properties. We will use the batch search forms of the function get_chem_info(), to which we supply a list of DTXSIDs.

ccl4$dtxsid
#>   [1] "DTXSID001024118" "DTXSID0020153"   "DTXSID0020446"   "DTXSID0020573"   "DTXSID0020600"  
#>   [6] "DTXSID0020814"   "DTXSID0021464"   "DTXSID0021541"   "DTXSID0021917"   "DTXSID0024052"  
#>  [11] "DTXSID0024341"   "DTXSID0032578"   "DTXSID1020437"   "DTXSID1021407"   "DTXSID1021409"  
#>  [16] "DTXSID1021740"   "DTXSID1021798"   "DTXSID1024174"   "DTXSID1024207"   "DTXSID1024338"  
#>  [21] "DTXSID1026164"   "DTXSID1031040"   "DTXSID1037484"   "DTXSID1037486"   "DTXSID1037567"  
#>  [26] "DTXSID2020684"   "DTXSID2021028"   "DTXSID2021317"   "DTXSID2021731"   "DTXSID2022333"  
#>  [31] "DTXSID2024169"   "DTXSID2031083"   "DTXSID2037506"   "DTXSID2040282"   "DTXSID2052156"  
#>  [36] "DTXSID3020203"   "DTXSID3020702"   "DTXSID3020833"   "DTXSID3020964"   "DTXSID3021857"  
#>  [41] "DTXSID3024366"   "DTXSID3024869"   "DTXSID3031864"   "DTXSID3032464"   "DTXSID3034458"  
#>  [46] "DTXSID3042219"   "DTXSID3073137"   "DTXSID3074313"   "DTXSID4020533"   "DTXSID4021503"  
#>  [51] "DTXSID4022361"   "DTXSID4022367"   "DTXSID4022448"   "DTXSID4022991"   "DTXSID4032611"  
#>  [56] "DTXSID4034948"   "DTXSID5020023"   "DTXSID5020576"   "DTXSID5020601"   "DTXSID5021207"  
#>  [61] "DTXSID5024182"   "DTXSID5039224"   "DTXSID50867064"  "DTXSID6020301"   "DTXSID6020856"  
#>  [66] "DTXSID6021030"   "DTXSID6021032"   "DTXSID6022422"   "DTXSID6024177"   "DTXSID6037483"  
#>  [71] "DTXSID6037485"   "DTXSID6037568"   "DTXSID7020005"   "DTXSID7020215"   "DTXSID7020637"  
#>  [76] "DTXSID7021029"   "DTXSID7024241"   "DTXSID7047433"   "DTXSID8020044"   "DTXSID8020090"  
#>  [81] "DTXSID8020597"   "DTXSID8020832"   "DTXSID8021062"   "DTXSID8022292"   "DTXSID8022377"  
#>  [86] "DTXSID8023846"   "DTXSID8023848"   "DTXSID8025541"   "DTXSID8031865"   "DTXSID8052483"  
#>  [91] "DTXSID9020243"   "DTXSID9021390"   "DTXSID9021427"   "DTXSID9022366"   "DTXSID9023380"  
#>  [96] "DTXSID9023914"   "DTXSID9024142"   "DTXSID9032113"   "DTXSID9032119"   "DTXSID9032329"
natadb$dtxsid
#>   [1] "DTXSID0020153"  "DTXSID0020448"  "DTXSID0020523"  "DTXSID0020529"  "DTXSID0020600" 
#>   [6] "DTXSID0020868"  "DTXSID0021381"  "DTXSID0021383"  "DTXSID0021541"  "DTXSID0021834" 
#>  [11] "DTXSID0021917"  "DTXSID0021965"  "DTXSID0024187"  "DTXSID0024260"  "DTXSID0039227" 
#>  [16] "DTXSID0039229"  "DTXSID00872421" "DTXSID1020148"  "DTXSID1020273"  "DTXSID1020302" 
#>  [21] "DTXSID1020306"  "DTXSID1020431"  "DTXSID1020437"  "DTXSID1020512"  "DTXSID1020516" 
#>  [26] "DTXSID1020566"  "DTXSID1021374"  "DTXSID1021798"  "DTXSID1021827"  "DTXSID1022057" 
#>  [31] "DTXSID1023786"  "DTXSID1024045"  "DTXSID1024382"  "DTXSID1026164"  "DTXSID1049641" 
#>  [36] "DTXSID10872417" "DTXSID2020137"  "DTXSID2020262"  "DTXSID2020507"  "DTXSID2020682" 
#>  [41] "DTXSID2020688"  "DTXSID2020711"  "DTXSID2020844"  "DTXSID2021105"  "DTXSID2021157" 
#>  [46] "DTXSID2021159"  "DTXSID2021284"  "DTXSID2021286"  "DTXSID2021319"  "DTXSID2021446" 
#>  [51] "DTXSID2021658"  "DTXSID2021731"  "DTXSID2021781"  "DTXSID2021993"  "DTXSID3020203" 
#>  [56] "DTXSID3020257"  "DTXSID3020413"  "DTXSID3020415"  "DTXSID3020596"  "DTXSID3020679" 
#>  [61] "DTXSID3020702"  "DTXSID3020833"  "DTXSID3020964"  "DTXSID3021431"  "DTXSID3021932" 
#>  [66] "DTXSID3022455"  "DTXSID3024366"  "DTXSID3025091"  "DTXSID3039242"  "DTXSID30872414"
#>  [71] "DTXSID30872419" "DTXSID4020161"  "DTXSID4020298"  "DTXSID4020402"  "DTXSID4020533" 
#>  [76] "DTXSID4020583"  "DTXSID4020874"  "DTXSID4020901"  "DTXSID4021006"  "DTXSID4021056" 
#>  [81] "DTXSID4021395"  "DTXSID4024143"  "DTXSID4024359"  "DTXSID4039231"  "DTXSID40872425"
#>  [86] "DTXSID5020023"  "DTXSID5020027"  "DTXSID5020029"  "DTXSID5020071"  "DTXSID5020316" 
#>  [91] "DTXSID5020449"  "DTXSID5020491"  "DTXSID5020601"  "DTXSID5020607"  "DTXSID5020865" 
#>  [96] "DTXSID5021124"  "DTXSID5021207"  "DTXSID5021380"  "DTXSID5021386"  "DTXSID5021889" 
#> [101] "DTXSID5024055"  "DTXSID5024059"  "DTXSID5024267"  "DTXSID5039224"  "DTXSID6020145" 
#> [106] "DTXSID6020307"  "DTXSID6020353"  "DTXSID6020432"  "DTXSID6020438"  "DTXSID6020515" 
#> [111] "DTXSID6020569"  "DTXSID6020981"  "DTXSID6021828"  "DTXSID6022422"  "DTXSID6023947" 
#> [116] "DTXSID6023949"  "DTXSID7020005"  "DTXSID7020009"  "DTXSID7020267"  "DTXSID7020637" 
#> [121] "DTXSID7020683"  "DTXSID7020687"  "DTXSID7020689"  "DTXSID7020710"  "DTXSID7020716" 
#> [126] "DTXSID7021029"  "DTXSID7021100"  "DTXSID7021106"  "DTXSID7021318"  "DTXSID7021360" 
#> [131] "DTXSID7021368"  "DTXSID7021948"  "DTXSID7023984"  "DTXSID7024166"  "DTXSID7024370" 
#> [136] "DTXSID7024532"  "DTXSID7025180"  "DTXSID7026156"  "DTXSID8020090"  "DTXSID8020173" 
#> [141] "DTXSID8020250"  "DTXSID8020597"  "DTXSID8020599"  "DTXSID8020759"  "DTXSID8020832" 
#> [146] "DTXSID8020913"  "DTXSID8021195"  "DTXSID8021197"  "DTXSID8021432"  "DTXSID8021434" 
#> [151] "DTXSID8021438"  "DTXSID8024286"  "DTXSID8042476"  "DTXSID9020168"  "DTXSID9020243" 
#> [156] "DTXSID9020247"  "DTXSID9020293"  "DTXSID9020299"  "DTXSID9020827"  "DTXSID9021138" 
#> [161] "DTXSID9021261"  "DTXSID9041522"  "DTXSID90872415"

ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid)
#> Warning in get_chem_info_batch(ccl4$dtxsid): Setting type to ''!
natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid)
#> Warning in get_chem_info_batch(natadb$dtxsid): Setting type to ''!

Observe that this returns a single data.table for each query, and the data.table contains the physico-chemical properties available from the CompTox Chemicals Dashboard for each chemical in the query. Note, a warning message was triggered, Warning: Setting type to ''!, which indicates the the parameter type was not given a value. A default value is set within the function and more information can be found in the associated documentation. We examine the set of physico-chemical properties for the first chemical in CCL4.

Before any deeper analysis, let’s take a look at the dimensions of the data and the column names.

dim(ccl4_phys_chem)
#> [1] 3641   10
colnames(ccl4_phys_chem)
#>  [1] "name"        "value"       "id"          "source"      "description" "propType"   
#>  [7] "unit"        "propertyId"  "dtxsid"      "dtxcid"

Next, we display the unique values for the columns propertyID and propType.

ccl4_phys_chem[, unique(propertyId)]
#>  [1] "boiling-point"        "logkow-octanol-water" "melting-point"        "vapor-pressure"      
#>  [5] "water-solubility"     "density"              "flash-point"          "henrys-law"          
#>  [9] "index-of-refraction"  "logkoa-octanol-air"   "molar-refractivity"   "molar-volume"        
#> [13] "polarizability"       "surface-tension"      "thermal-conductivity" "viscosity"           
#> [17] "pka-acidic-apparent"  "pka-basic-apparent"
ccl4_phys_chem[, unique(propType)]
#> [1] "experimental" "predicted"

Let’s explore this further by examining the mean of the “boiling-point” and “melting-point” data.

ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value))]
#>        Mean
#>       <num>
#> 1: 253.7974
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value)),
               by = .(propType)]
#>        propType     Mean
#>          <char>    <num>
#> 1: experimental 250.5943
#> 2:    predicted 255.5472

ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value))]
#>        Mean
#>       <num>
#> 1: 36.14033
ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value)),
               by = .(propType)]
#>        propType     Mean
#>          <char>    <num>
#> 1: experimental 23.64972
#> 2:    predicted 51.93584

These results tell us about some of the reported physico-chemical properties of the data sets.

Answer to Environmental Health Question 1

With this, we can answer Environmental Health Question 1: After automatically pulling the fourth Drinking Water Contaminant Candidate List from the CompTox Chemicals Dashboard, list the properties and property types present in the data. What are the mean values for a specific property when grouped by property type and when ungrouped?

Answer: The mean “boiling-point” is 253.7974 degrees Celsius for CCL4, with mean values of 250.5943 and 255.5472 for experimental and predicted, respectively. The mean “melting-point” is 36.14033 degrees Celsius for CCL4, with mean values of 23.64972 and 51.93584 for experimental and predicted, respectively.

To explore all the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property (“boiling-point”, “melting-point”, etc…), and then additionally group those by property type (“experimental” vs “predicted”). In the grouping, we look at the columns value, unit, propertyID and propType. We also demonstrate how take the mean of the values for each grouping. We examine the chemical with DTXSID “DTXSID0020153” from CCL4.

head(ccl4_phys_chem[dtxsid == 'DTXSID0020153', ])
#>                     name   value       id                    source               description
#>                   <char>   <num>    <int>                    <char>                    <char>
#> 1:         Boiling Point 179.000 21943219 SynQuest Labs (Chemical c SynQuest has a focus on f
#> 2:         Boiling Point 178.889 23836860                     NIOSH The NIOSH Pocket Guide to
#> 3:         Boiling Point 179.000 11959465              PhysPropNCCT The PHYSPROP data sets ar
#> 4:         Boiling Point 179.000 15585758 Biosynth (Chemical compan Biosynth  produces reagen
#> 5:         Boiling Point 179.000  8401215 Alfa Aesar (Chemical comp Alfa Aesar is a leading i
#> 6: LogKow: Octanol-Water   2.300 17007695              PhysPropNCCT The PHYSPROP data sets ar
#> 5 variable(s) not shown: [propType <char>, unit <char>, propertyId <char>, dtxsid <char>, dtxcid <char>]
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(propType, value, unit),
               by = .(propertyId)]
#>               propertyId     propType        value        unit
#>                   <char>       <char>        <num>      <char>
#>  1:        boiling-point experimental 179.00000000          °C
#>  2:        boiling-point experimental 178.88900000          °C
#>  3:        boiling-point experimental 179.00000000          °C
#>  4:        boiling-point experimental 179.00000000          °C
#>  5:        boiling-point experimental 179.00000000          °C
#>  6:        boiling-point    predicted 178.47300000          °C
#>  7:        boiling-point    predicted 179.39900000          °C
#>  8:        boiling-point    predicted 178.70000000          °C
#>  9:        boiling-point    predicted 184.18000000          °C
#> 10: logkow-octanol-water experimental   2.30000000        <NA>
#> 11: logkow-octanol-water    predicted   2.63800000        <NA>
#> 12: logkow-octanol-water    predicted   2.30213000        <NA>
#> 13: logkow-octanol-water    predicted   2.48700000        <NA>
#> 14: logkow-octanol-water    predicted   2.79000000        <NA>
#> 15:        melting-point experimental -39.20000000          °C
#> 16:        melting-point experimental -43.00000000          °C
#> 17:        melting-point experimental -45.00000000          °C
#> 18:        melting-point experimental -38.88890000          °C
#> 19:        melting-point experimental -45.00000000          °C
#> 20:        melting-point experimental -43.00000000          °C
#> 21:        melting-point experimental -43.00000000          °C
#> 22:        melting-point    predicted -26.57100000          °C
#> 23:        melting-point    predicted -44.01140000          °C
#> 24:        melting-point    predicted -27.15000000          °C
#> 25:       vapor-pressure experimental   1.22999000        mmHg
#> 26:       vapor-pressure    predicted   1.23752000        mmHg
#> 27:       vapor-pressure    predicted   1.99986000        mmHg
#> 28:       vapor-pressure    predicted   1.27700000        mmHg
#> 29:     water-solubility experimental   0.00407380       mol/L
#> 30:     water-solubility experimental   0.00955000       mol/L
#> 31:     water-solubility experimental   0.00417000       mol/L
#> 32:     water-solubility experimental   0.00407380       mol/L
#> 33:     water-solubility experimental   0.00414763       mol/L
#> 34:     water-solubility    predicted   0.00300608       mol/L
#> 35:     water-solubility    predicted   0.00813715       mol/L
#> 36:     water-solubility    predicted   0.00484108       mol/L
#> 37:     water-solubility    predicted   0.00100000       mol/L
#> 38:              density    predicted   1.08100000      g/cm^3
#> 39:              density    predicted   1.10000000      g/cm^3
#> 40:          flash-point    predicted  73.88900000          °C
#> 41:          flash-point    predicted  73.39500000          °C
#> 42:           henrys-law    predicted   0.00291872 atm-m3/mole
#> 43:  index-of-refraction    predicted   1.52700000        <NA>
#> 44:   logkoa-octanol-air    predicted   4.15646000        <NA>
#> 45:   molar-refractivity    predicted  36.01800000        cm^3
#> 46:         molar-volume    predicted 117.12900000        cm^3
#> 47:       polarizability    predicted  14.27900000         Å^3
#> 48:      surface-tension    predicted  34.66800000      dyn/cm
#> 49:      surface-tension    predicted  33.85300000      dyn/cm
#> 50: thermal-conductivity    predicted 131.28800000    mW/(m*K)
#> 51:            viscosity    predicted   1.40281000          cP
#>               propertyId     propType        value        unit
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(value, unit), 
               by = .(propertyId, propType)]
#>               propertyId     propType        value        unit
#>                   <char>       <char>        <num>      <char>
#>  1:        boiling-point experimental 179.00000000          °C
#>  2:        boiling-point experimental 178.88900000          °C
#>  3:        boiling-point experimental 179.00000000          °C
#>  4:        boiling-point experimental 179.00000000          °C
#>  5:        boiling-point experimental 179.00000000          °C
#>  6: logkow-octanol-water experimental   2.30000000        <NA>
#>  7:        melting-point experimental -39.20000000          °C
#>  8:        melting-point experimental -43.00000000          °C
#>  9:        melting-point experimental -45.00000000          °C
#> 10:        melting-point experimental -38.88890000          °C
#> 11:        melting-point experimental -45.00000000          °C
#> 12:        melting-point experimental -43.00000000          °C
#> 13:        melting-point experimental -43.00000000          °C
#> 14:       vapor-pressure experimental   1.22999000        mmHg
#> 15:     water-solubility experimental   0.00407380       mol/L
#> 16:     water-solubility experimental   0.00955000       mol/L
#> 17:     water-solubility experimental   0.00417000       mol/L
#> 18:     water-solubility experimental   0.00407380       mol/L
#> 19:     water-solubility experimental   0.00414763       mol/L
#> 20:        boiling-point    predicted 178.47300000          °C
#> 21:        boiling-point    predicted 179.39900000          °C
#> 22:        boiling-point    predicted 178.70000000          °C
#> 23:        boiling-point    predicted 184.18000000          °C
#> 24:              density    predicted   1.08100000      g/cm^3
#> 25:              density    predicted   1.10000000      g/cm^3
#> 26:          flash-point    predicted  73.88900000          °C
#> 27:          flash-point    predicted  73.39500000          °C
#> 28:           henrys-law    predicted   0.00291872 atm-m3/mole
#> 29:  index-of-refraction    predicted   1.52700000        <NA>
#> 30:   logkoa-octanol-air    predicted   4.15646000        <NA>
#> 31: logkow-octanol-water    predicted   2.63800000        <NA>
#> 32: logkow-octanol-water    predicted   2.30213000        <NA>
#> 33: logkow-octanol-water    predicted   2.48700000        <NA>
#> 34: logkow-octanol-water    predicted   2.79000000        <NA>
#> 35:        melting-point    predicted -26.57100000          °C
#> 36:        melting-point    predicted -44.01140000          °C
#> 37:        melting-point    predicted -27.15000000          °C
#> 38:   molar-refractivity    predicted  36.01800000        cm^3
#> 39:         molar-volume    predicted 117.12900000        cm^3
#> 40:       polarizability    predicted  14.27900000         Å^3
#> 41:      surface-tension    predicted  34.66800000      dyn/cm
#> 42:      surface-tension    predicted  33.85300000      dyn/cm
#> 43: thermal-conductivity    predicted 131.28800000    mW/(m*K)
#> 44:       vapor-pressure    predicted   1.23752000        mmHg
#> 45:       vapor-pressure    predicted   1.99986000        mmHg
#> 46:       vapor-pressure    predicted   1.27700000        mmHg
#> 47:            viscosity    predicted   1.40281000          cP
#> 48:     water-solubility    predicted   0.00300608       mol/L
#> 49:     water-solubility    predicted   0.00813715       mol/L
#> 50:     water-solubility    predicted   0.00484108       mol/L
#> 51:     water-solubility    predicted   0.00100000       mol/L
#>               propertyId     propType        value        unit

ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(Mean_value = sapply(.SD, mean)),
               by = .(propertyId, unit), .SDcols = c("value")]
#>               propertyId        unit    Mean_value
#>                   <char>      <char>         <num>
#>  1:        boiling-point          °C 179.515666667
#>  2: logkow-octanol-water        <NA>   2.503426000
#>  3:        melting-point          °C -39.482130000
#>  4:       vapor-pressure        mmHg   1.436092500
#>  5:     water-solubility       mol/L   0.004777727
#>  6:              density      g/cm^3   1.090500000
#>  7:          flash-point          °C  73.642000000
#>  8:           henrys-law atm-m3/mole   0.002918720
#>  9:  index-of-refraction        <NA>   1.527000000
#> 10:   logkoa-octanol-air        <NA>   4.156460000
#> 11:   molar-refractivity        cm^3  36.018000000
#> 12:         molar-volume        cm^3 117.129000000
#> 13:       polarizability         Å^3  14.279000000
#> 14:      surface-tension      dyn/cm  34.260500000
#> 15: thermal-conductivity    mW/(m*K) 131.288000000
#> 16:            viscosity          cP   1.402810000
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(Mean_value = sapply(.SD, mean)), 
               by = .(propertyId, unit, propType), 
               .SDcols = c("value")][order(propertyId)]
#>               propertyId        unit     propType    Mean_value
#>                   <char>      <char>       <char>         <num>
#>  1:        boiling-point          °C experimental 178.977800000
#>  2:        boiling-point          °C    predicted 180.188000000
#>  3:              density      g/cm^3    predicted   1.090500000
#>  4:          flash-point          °C    predicted  73.642000000
#>  5:           henrys-law atm-m3/mole    predicted   0.002918720
#>  6:  index-of-refraction        <NA>    predicted   1.527000000
#>  7:   logkoa-octanol-air        <NA>    predicted   4.156460000
#>  8: logkow-octanol-water        <NA> experimental   2.300000000
#>  9: logkow-octanol-water        <NA>    predicted   2.554282500
#> 10:        melting-point          °C experimental -42.441271429
#> 11:        melting-point          °C    predicted -32.577466667
#> 12:   molar-refractivity        cm^3    predicted  36.018000000
#> 13:         molar-volume        cm^3    predicted 117.129000000
#> 14:       polarizability         Å^3    predicted  14.279000000
#> 15:      surface-tension      dyn/cm    predicted  34.260500000
#> 16: thermal-conductivity    mW/(m*K)    predicted 131.288000000
#> 17:       vapor-pressure        mmHg experimental   1.229990000
#> 18:       vapor-pressure        mmHg    predicted   1.504793333
#> 19:            viscosity          cP    predicted   1.402810000
#> 20:     water-solubility       mol/L experimental   0.005203046
#> 21:     water-solubility       mol/L    predicted   0.004246077
#>               propertyId        unit     propType    Mean_value

Analyzing and Visualizing Physico-chemical Properties from Two Environmental Contaminant Lists

We consider exploring the differences in mean predicted and experimental values for a variety of physico-chemical properties in an effort to understand better the CCL4 and NATADB lists. In particular, we examine “vapor-pressure”, “henrys-law”, and “boiling-point” and plot the means by chemical for these using boxplots. We then compare the values by grouping by both data set and propType value.

We first examine the vapor pressures for all the chemicals in each list. We then graph these, grouped by propType and pooled together in separate plots. For this we will use boxplots.

Group first by DTXSID.

ccl4_vapor_all <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', 
                                 .(mean_vapor_pressure = sapply(.SD, mean)), 
                                 .SDcols = c('value'), by = .(dtxsid)]
natadb_vapor_all <- natadb_phys_chem[propertyId %in% 'vapor-pressure', 
                                     .(mean_vapor_pressure = sapply(.SD, mean)),
                                     .SDcols = c('value'), by = .(dtxsid)]

Then group by DTXSID and then by property type.

ccl4_vapor_grouped <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', 
                                     .(mean_vapor_pressure = sapply(.SD, mean)),
                                     .SDcols = c('value'), 
                                     by = .(dtxsid, propType)]
natadb_vapor_grouped <- natadb_phys_chem[propertyId %in% 'vapor-pressure', 
                                         .(mean_vapor_pressure = 
                                             sapply(.SD, mean)), 
                                         .SDcols = c('value'), 
                                         by = .(dtxsid, propType)]

Then examine the summary statistics of the data.

summary(ccl4_vapor_all)
#>     dtxsid          mean_vapor_pressure
#>  Length:88          Min.   :   0.000   
#>  Class :character   1st Qu.:   0.000   
#>  Mode  :character   Median :   0.001   
#>                     Mean   : 211.735   
#>                     3rd Qu.:   5.009   
#>                     Max.   :6007.244
summary(ccl4_vapor_grouped)
#>     dtxsid            propType         mean_vapor_pressure
#>  Length:147         Length:147         Min.   :   0.000   
#>  Class :character   Class :character   1st Qu.:   0.000   
#>  Mode  :character   Mode  :character   Median :   0.060   
#>                                        Mean   : 252.599   
#>                                        3rd Qu.:   9.851   
#>                                        Max.   :7252.700
summary(natadb_vapor_all)
#>     dtxsid          mean_vapor_pressure
#>  Length:151         Min.   :    0.00   
#>  Class :character   1st Qu.:    0.01   
#>  Mode  :character   Median :    1.44   
#>                     Mean   :  851.70   
#>                     3rd Qu.:  116.65   
#>                     Max.   :33917.88
summary(natadb_vapor_grouped)
#>     dtxsid            propType         mean_vapor_pressure
#>  Length:274         Length:274         Min.   :    0.00   
#>  Class :character   Class :character   1st Qu.:    0.01   
#>  Mode  :character   Mode  :character   Median :    1.68   
#>                                        Mean   :  587.02   
#>                                        3rd Qu.:  112.59   
#>                                        Max.   :33917.88

With such a large range of values covering several orders of magnitude, we log transform the data. Since these value are positive, we do not have to worry about illegal transformations.

ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
#>             dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>               <num>                             <num>
#>  1:  DTXSID0020153        1.436092e+00                         0.3619259
#>  2:  DTXSID0020446        1.211530e-06                       -13.6236263
#>  3:  DTXSID0020573        1.016181e-08                       -18.4046290
#>  4:  DTXSID0020600        1.060440e+03                         6.9664394
#>  5:  DTXSID0020814        1.713402e-08                       -17.8821999
#>  6:  DTXSID0021464        1.020300e+01                         2.3226818
#>  7:  DTXSID0021541        3.623345e+03                         8.1951530
#>  8:  DTXSID0021917        1.447043e+02                         4.9746920
#>  9:  DTXSID0024052        2.579787e-07                       -15.1703889
#> 10:  DTXSID0032578        3.546976e-05                       -10.2468302
#> 11:  DTXSID1020437        2.153143e+02                         5.3720986
#> 12:  DTXSID1021407        9.093477e-04                        -7.0027830
#> 13:  DTXSID1021409        4.170000e-08                       -16.9927647
#> 14:  DTXSID1021740        6.808950e+00                         1.9182379
#> 15:  DTXSID1021798        5.779040e-02                        -2.8509326
#> 16:  DTXSID1024174        5.616588e-06                       -12.0897863
#> 17:  DTXSID1024338        7.125250e-08                       -16.4570359
#> 18:  DTXSID1026164        2.444570e-01                        -1.4087159
#> 19:  DTXSID1037484        4.124746e-07                       -14.7010911
#> 20:  DTXSID1037486        4.206599e-07                       -14.6814412
#> 21:  DTXSID1037567        4.633640e-08                       -16.8873380
#> 22:  DTXSID2020684        3.733735e-03                        -5.5903463
#> 23:  DTXSID2021028        1.127656e+00                         0.1201408
#> 24:  DTXSID2021317        1.563302e+01                         2.7493857
#> 25:  DTXSID2021731        2.360838e+02                         5.4641866
#> 26:  DTXSID2022333        1.685188e+00                         0.5218768
#> 27:  DTXSID2037506        8.393291e-06                       -11.6880778
#> 28:  DTXSID2052156        3.878108e-09                       -19.3679185
#> 29:  DTXSID3020203        1.876293e+03                         7.5370534
#> 30:  DTXSID3020702        2.074800e+01                         3.0324499
#> 31:  DTXSID3020833        2.420272e+02                         5.4890503
#> 32:  DTXSID3020964        2.336795e-01                        -1.4538048
#> 33:  DTXSID3024366        5.501190e+01                         4.0075495
#> 34:  DTXSID3024869        1.885188e-02                        -3.9711425
#> 35:  DTXSID3031864        2.479965e-06                       -12.9072661
#> 36:  DTXSID3032464        1.727766e-06                       -13.2686813
#> 37:  DTXSID3034458        6.440000e-11                       -23.4659075
#> 38:  DTXSID3042219        3.106110e+00                         1.1333711
#> 39:  DTXSID3074313        1.481081e-11                       -24.9356641
#> 40:  DTXSID4020533        4.036368e+01                         3.6979302
#> 41:  DTXSID4021503        1.722775e+02                         5.1491065
#> 42:  DTXSID4022361        3.226160e-06                       -12.6442179
#> 43:  DTXSID4022367        1.643378e-08                       -17.9239267
#> 44:  DTXSID4022448        1.773129e-05                       -10.9401797
#> 45:  DTXSID4022991        1.440972e-10                       -22.6605333
#> 46:  DTXSID4032611        5.263403e-04                        -7.5495627
#> 47:  DTXSID4034948        3.744327e-08                       -17.1004389
#> 48:  DTXSID5020023        2.545125e+02                         5.5393500
#> 49:  DTXSID5020576        5.737673e-09                       -18.9762122
#> 50:  DTXSID5020601        1.428200e+00                         0.3564147
#> 51:  DTXSID5021207        4.463567e+02                         6.1011185
#> 52:  DTXSID5024182        7.171160e+00                         1.9700674
#> 53:  DTXSID5039224        7.963770e+02                         6.6800727
#> 54: DTXSID50867064        1.522060e-03                        -6.4876906
#> 55:  DTXSID6020301        6.007244e+03                         8.7007213
#> 56:  DTXSID6020856        2.909907e-01                        -1.2344638
#> 57:  DTXSID6021030        3.562765e-05                       -10.2423885
#> 58:  DTXSID6021032        8.306000e-01                        -0.1856069
#> 59:  DTXSID6022422        1.921524e-06                       -13.1623921
#> 60:  DTXSID6024177        7.035691e-02                        -2.6541743
#> 61:  DTXSID6037483        4.847560e-08                       -16.8422053
#> 62:  DTXSID6037485        4.976030e-08                       -16.8160484
#> 63:  DTXSID6037568        2.875770e-07                       -15.0617752
#> 64:  DTXSID7020005        6.776980e-02                        -2.6916386
#> 65:  DTXSID7020637        1.836866e+03                         7.5158163
#> 66:  DTXSID7021029        3.323853e+00                         1.2011248
#> 67:  DTXSID7024241        2.031631e-06                       -13.1066715
#> 68:  DTXSID7047433        1.073489e-08                       -18.3497663
#> 69:  DTXSID8020044        2.447645e+01                         3.1977114
#> 70:  DTXSID8020090        5.073765e-01                        -0.6785019
#> 71:  DTXSID8020597        2.037895e-01                        -1.5906676
#> 72:  DTXSID8020832        1.520093e+03                         7.3265271
#> 73:  DTXSID8021062        1.180510e-01                        -2.1366388
#> 74:  DTXSID8022292        1.528695e-08                       -17.9962663
#> 75:  DTXSID8022377        1.016181e-08                       -18.4046290
#> 76:  DTXSID8023846        1.702425e-06                       -13.2834569
#> 77:  DTXSID8023848        8.277807e-06                       -11.7019325
#> 78:  DTXSID8025541        1.949228e-05                       -10.8454918
#> 79:  DTXSID8031865        8.386721e-01                        -0.1759355
#> 80:  DTXSID9020243        1.186626e-04                        -9.0392263
#> 81:  DTXSID9021390        4.408660e+00                         1.4835708
#> 82:  DTXSID9021427        3.637282e-01                        -1.0113483
#> 83:  DTXSID9022366        1.506794e-09                       -20.3132814
#> 84:  DTXSID9023380        1.149256e-08                       -18.2815660
#> 85:  DTXSID9023914        3.131080e-04                        -8.0689624
#> 86:  DTXSID9024142        2.739923e-09                       -19.7153359
#> 87:  DTXSID9032113        2.981382e-08                       -17.3282936
#> 88:  DTXSID9032329        7.041032e-07                       -14.1663408
#>             dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
ccl4_vapor_grouped[, log_transform_mean_vapor_pressure := 
                     log(mean_vapor_pressure)]
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>       <char>               <num>                             <num>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556
#>   3: DTXSID0020446 experimental        6.899220e-08                       -16.4892724
#>   4: DTXSID0020446    predicted        1.592376e-06                       -13.3502831
#>   5: DTXSID0020573    predicted        1.016181e-08                       -18.4046290
#>  ---                                                                                 
#> 143: DTXSID9024142    predicted        2.059810e-09                       -20.0006521
#> 144: DTXSID9032113 experimental        1.279970e-08                       -18.1738441
#> 145: DTXSID9032113    predicted        3.548520e-08                       -17.1541501
#> 146: DTXSID9032329 experimental        8.000180e-07                       -14.0386316
#> 147: DTXSID9032329    predicted        6.721317e-07                       -14.2128116

natadb_vapor_all[, log_transform_mean_vapor_pressure := 
                   log(mean_vapor_pressure)]
#>             dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>               <num>                             <num>
#>   1: DTXSID0020153        1.436092e+00                         0.3619259
#>   2: DTXSID0020448        4.899910e+01                         3.8918019
#>   3: DTXSID0020523        2.824828e-04                        -8.1718931
#>   4: DTXSID0020529        7.603507e-04                        -7.1817307
#>   5: DTXSID0020600        1.060440e+03                         6.9664394
#>  ---                                                                    
#> 147: DTXSID9020293        3.124725e-02                        -3.4658239
#> 148: DTXSID9020299        1.177018e-06                       -13.6525264
#> 149: DTXSID9020827        2.930390e-06                       -12.7403750
#> 150: DTXSID9021138        3.619153e-03                        -5.6215152
#> 151: DTXSID9041522        6.433997e-05                        -9.6513295
natadb_vapor_grouped[, log_transform_mean_vapor_pressure := 
                       log(mean_vapor_pressure)]
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>       <char>               <num>                             <num>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556
#>   3: DTXSID0020448 experimental        5.329670e+01                         3.9758744
#>   4: DTXSID0020448    predicted        4.756657e+01                         3.8621301
#>   5: DTXSID0020523 experimental        3.900320e-04                        -7.8492818
#>  ---                                                                                 
#> 270: DTXSID9020299 experimental        2.199890e-06                       -13.0271032
#> 271: DTXSID9020299    predicted        8.360607e-07                       -13.9945647
#> 272: DTXSID9020827    predicted        2.930390e-06                       -12.7403750
#> 273: DTXSID9021138    predicted        3.619153e-03                        -5.6215152
#> 274: DTXSID9041522    predicted        6.433997e-05                        -9.6513295

Now we plot the log transformed data.

First plot the CCL4 data.

ggplot(ccl4_vapor_all, aes(log_transform_mean_vapor_pressure)) +
  geom_boxplot() +
  coord_flip()

ggplot(ccl4_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
  geom_boxplot()

Then plot the NATA data.

ggplot(natadb_vapor_all, aes(log_transform_mean_vapor_pressure)) +
  geom_boxplot() + coord_flip()

ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
  geom_boxplot()

Finally, we compare both sets simultaneously. We add in a column to each data.table denoting to which data set the rows correspond and then combine the rows from both data sets together using the function rbind().

ccl4_vapor_grouped[, set := 'CCL4']
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure    set
#>             <char>       <char>               <num>                             <num> <char>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060   CCL4
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556   CCL4
#>   3: DTXSID0020446 experimental        6.899220e-08                       -16.4892724   CCL4
#>   4: DTXSID0020446    predicted        1.592376e-06                       -13.3502831   CCL4
#>   5: DTXSID0020573    predicted        1.016181e-08                       -18.4046290   CCL4
#>  ---                                                                                        
#> 143: DTXSID9024142    predicted        2.059810e-09                       -20.0006521   CCL4
#> 144: DTXSID9032113 experimental        1.279970e-08                       -18.1738441   CCL4
#> 145: DTXSID9032113    predicted        3.548520e-08                       -17.1541501   CCL4
#> 146: DTXSID9032329 experimental        8.000180e-07                       -14.0386316   CCL4
#> 147: DTXSID9032329    predicted        6.721317e-07                       -14.2128116   CCL4
natadb_vapor_grouped[, set := 'NATADB']
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure    set
#>             <char>       <char>               <num>                             <num> <char>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060 NATADB
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556 NATADB
#>   3: DTXSID0020448 experimental        5.329670e+01                         3.9758744 NATADB
#>   4: DTXSID0020448    predicted        4.756657e+01                         3.8621301 NATADB
#>   5: DTXSID0020523 experimental        3.900320e-04                        -7.8492818 NATADB
#>  ---                                                                                        
#> 270: DTXSID9020299 experimental        2.199890e-06                       -13.0271032 NATADB
#> 271: DTXSID9020299    predicted        8.360607e-07                       -13.9945647 NATADB
#> 272: DTXSID9020827    predicted        2.930390e-06                       -12.7403750 NATADB
#> 273: DTXSID9021138    predicted        3.619153e-03                        -5.6215152 NATADB
#> 274: DTXSID9041522    predicted        6.433997e-05                        -9.6513295 NATADB

all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped)

Now we plot the combined data. First we color the boxplots based on the property type, with mean log transformed vapor pressure plotted for each data set and property type.

vapor_box <- ggplot(all_vapor_grouped, 
                    aes(set, log_transform_mean_vapor_pressure)) + 
  geom_boxplot(aes(color = propType))
vapor_box

Next we color the boxplots based on the data set.

vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) +
  geom_boxplot((aes(color = set))) + 
  coord_flip()
vapor

In the plots above, when we graph the data separated both by data set and property type as well as just by data set, we observe the general trend that the NATADB chemicals have a higher mean vapor pressure than the CCL4 chemicals.

We also explore Henry’s Law constant and boiling point in a similar fashion.

Group by DTXSID.

ccl4_hlc_all <- ccl4_phys_chem[propertyId %in% 'henrys-law', 
                               .(mean_hlc = sapply(.SD, mean)), 
                               .SDcols = c('value'), by = .(dtxsid)]
natadb_hlc_all <- natadb_phys_chem[propertyId %in% 'henrys-law', 
                                   .(mean_hlc = sapply(.SD, mean)), 
                                   .SDcols = c('value'), by = .(dtxsid)]

Group by DTXSID and property type.

ccl4_hlc_grouped <- ccl4_phys_chem[propertyId %in% 'henrys-law', 
                                   .(mean_hlc = sapply(.SD, mean)), 
                                   .SDcols = c('value'), 
                                   by = .(dtxsid, propType)]
natadb_hlc_grouped <- natadb_phys_chem[propertyId %in% 'henrys-law', 
                                       .(mean_hlc = sapply(.SD, mean)), 
                                       .SDcols = c('value'), 
                                       by = .(dtxsid, propType)]

Examine summary statistics.

summary(ccl4_hlc_all)
#>     dtxsid             mean_hlc        
#>  Length:84          Min.   :0.0000000  
#>  Class :character   1st Qu.:0.0000000  
#>  Mode  :character   Median :0.0000007  
#>                     Mean   :0.0074491  
#>                     3rd Qu.:0.0000171  
#>                     Max.   :0.4922550
summary(ccl4_hlc_grouped)
#>     dtxsid            propType            mean_hlc        
#>  Length:112         Length:112         Min.   :0.0000000  
#>  Class :character   Class :character   1st Qu.:0.0000000  
#>  Mode  :character   Mode  :character   Median :0.0000020  
#>                                        Mean   :0.0062734  
#>                                        3rd Qu.:0.0001211  
#>                                        Max.   :0.4922550
summary(natadb_hlc_all)
#>     dtxsid             mean_hlc        
#>  Length:145         Min.   :0.0000000  
#>  Class :character   1st Qu.:0.0000002  
#>  Mode  :character   Median :0.0000310  
#>                     Mean   :0.0085716  
#>                     3rd Qu.:0.0011833  
#>                     Max.   :0.4922550
summary(natadb_hlc_grouped)
#>     dtxsid            propType            mean_hlc        
#>  Length:211         Length:211         Min.   :0.0000000  
#>  Class :character   Class :character   1st Qu.:0.0000003  
#>  Mode  :character   Mode  :character   Median :0.0001027  
#>                                        Mean   :0.0073001  
#>                                        3rd Qu.:0.0028065  
#>                                        Max.   :0.4922550

Again, we log transform the data as it is positive and covers several orders of magnitude.

ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     mean_hlc log_transform_mean_hlc
#>             <char>        <num>                  <num>
#>  1:  DTXSID0020153 2.918720e-03             -5.8366101
#>  2:  DTXSID0020446 1.527830e-09            -20.2994174
#>  3:  DTXSID0020573 3.748870e-06            -12.4940561
#>  4:  DTXSID0020600 1.479610e-04             -8.8185618
#>  5:  DTXSID0020814 2.049460e-07            -15.4005193
#>  6:  DTXSID0021541 8.825620e-03             -4.7300964
#>  7:  DTXSID0021917 4.922550e-01             -0.7087584
#>  8:  DTXSID0024052 2.338400e-10            -22.1763840
#>  9:  DTXSID0032578 2.362830e-09            -19.8634058
#> 10:  DTXSID1020437 5.639950e-03             -5.1778801
#> 11:  DTXSID1021407 1.768600e-06            -13.2453223
#> 12:  DTXSID1021740 8.824295e-06            -11.6380018
#> 13:  DTXSID1021798 8.664450e-06            -11.6562821
#> 14:  DTXSID1024174 2.964310e-07            -15.0314514
#> 15:  DTXSID1024338 1.646800e-10            -22.5270169
#> 16:  DTXSID1026164 1.977445e-06            -13.1337050
#> 17:  DTXSID1037484 1.196270e-09            -20.5440575
#> 18:  DTXSID1037486 1.185530e-09            -20.5530759
#> 19:  DTXSID1037567 4.339640e-10            -21.5580595
#> 20:  DTXSID2020684 3.194620e-06            -12.6540424
#> 21:  DTXSID2021028 3.662790e-06            -12.5172854
#> 22:  DTXSID2021317 2.493330e-03             -5.9941361
#> 23:  DTXSID2021731 4.553975e-06            -12.2995101
#> 24:  DTXSID2022333 8.026420e-03             -4.8250167
#> 25:  DTXSID2037506 1.313660e-09            -20.4504487
#> 26:  DTXSID2052156 3.785200e-10            -21.6947522
#> 27:  DTXSID3020203 4.129470e-02             -3.1870211
#> 28:  DTXSID3020833 5.902805e-04             -7.4349127
#> 29:  DTXSID3020964 2.389895e-05            -10.6416760
#> 30:  DTXSID3024366 1.485460e-04             -8.8146159
#> 31:  DTXSID3024869 1.776260e-06            -13.2410005
#> 32:  DTXSID3031864 1.803350e-11            -24.7387900
#> 33:  DTXSID3032464 8.835860e-06            -11.6366921
#> 34:  DTXSID3042219 1.047975e-02             -4.5583105
#> 35:  DTXSID3074313 2.046920e-11            -24.6120998
#> 36:  DTXSID4020533 4.847905e-06            -12.2369639
#> 37:  DTXSID4021503 3.771570e-03             -5.5802639
#> 38:  DTXSID4022361 2.458220e-08            -17.5212432
#> 39:  DTXSID4022367 1.047830e-09            -20.6765445
#> 40:  DTXSID4022448 9.004310e-09            -18.5255625
#> 41:  DTXSID4022991 1.238900e-11            -25.1142121
#> 42:  DTXSID4032611 1.477090e-05            -11.1228515
#> 43:  DTXSID4034948 1.537700e-09            -20.2929780
#> 44:  DTXSID5020023 1.214215e-04             -9.0162426
#> 45:  DTXSID5020576 9.440170e-08            -16.1757068
#> 46:  DTXSID5020601 3.864950e-09            -19.3713171
#> 47:  DTXSID5021207 1.374810e-04             -8.8920248
#> 48:  DTXSID5024182 3.307185e-07            -14.9219983
#> 49:  DTXSID5039224 6.639505e-05             -9.6198881
#> 50: DTXSID50867064 1.183860e-08            -18.2519005
#> 51:  DTXSID6020301 4.061025e-02             -3.2037348
#> 52:  DTXSID6020856 3.217130e-09            -19.5547762
#> 53:  DTXSID6021030 9.187830e-07            -13.9002159
#> 54:  DTXSID6021032 3.249700e-04             -8.0317777
#> 55:  DTXSID6022422 1.067460e-07            -16.0528137
#> 56:  DTXSID6024177 2.553800e-07            -15.1805132
#> 57:  DTXSID6037483 5.542140e-10            -21.3134702
#> 58:  DTXSID6037485 5.623550e-10            -21.2988878
#> 59:  DTXSID6037568 8.289390e-09            -18.6082895
#> 60:  DTXSID7020005 8.830950e-08            -16.2424181
#> 61:  DTXSID7020637 3.432355e-07            -14.8848490
#> 62:  DTXSID7021029 3.650960e-05            -10.2179353
#> 63:  DTXSID7024241 3.283070e-06            -12.6267316
#> 64:  DTXSID7047433 6.785440e-08            -16.5059016
#> 65:  DTXSID8020044 4.999450e-06            -12.2061827
#> 66:  DTXSID8020090 2.019270e-06            -13.1127745
#> 67:  DTXSID8020597 5.999740e-08            -16.6289646
#> 68:  DTXSID8020832 7.367270e-03             -4.9107081
#> 69:  DTXSID8021062 1.159700e-05            -11.3647641
#> 70:  DTXSID8022292 2.387120e-08            -17.5505931
#> 71:  DTXSID8022377 3.748870e-06            -12.4940561
#> 72:  DTXSID8023846 4.938440e-09            -19.1262163
#> 73:  DTXSID8023848 9.954840e-09            -18.4252070
#> 74:  DTXSID8025541 4.150640e-07            -14.6948331
#> 75:  DTXSID8031865 1.916920e-10            -22.3751312
#> 76:  DTXSID9020243 3.705160e-06            -12.5057841
#> 77:  DTXSID9021390 3.432295e-04             -7.9771112
#> 78:  DTXSID9021427 5.556870e-07            -14.4030607
#> 79:  DTXSID9022366 5.167830e-09            -19.0808130
#> 80:  DTXSID9023380 4.156230e-09            -19.2986574
#> 81:  DTXSID9023914 5.035540e-11            -23.7119153
#> 82:  DTXSID9024142 2.692760e-06            -12.8249439
#> 83:  DTXSID9032113 3.098320e-07            -14.9872356
#> 84:  DTXSID9032329 1.566850e-06            -13.3664433
#>             dtxsid     mean_hlc log_transform_mean_hlc
ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc
#>             <char>       <char>       <num>                  <num>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610
#>   2: DTXSID0020446    predicted 1.52783e-09             -20.299417
#>   3: DTXSID0020573    predicted 3.74887e-06             -12.494056
#>   4: DTXSID0020600 experimental 1.48000e-04              -8.818298
#>   5: DTXSID0020600    predicted 1.47922e-04              -8.818825
#>  ---                                                              
#> 108: DTXSID9023914 experimental 5.03002e-11             -23.713012
#> 109: DTXSID9023914    predicted 5.04106e-11             -23.710820
#> 110: DTXSID9024142    predicted 2.69276e-06             -12.824944
#> 111: DTXSID9032113    predicted 3.09832e-07             -14.987236
#> 112: DTXSID9032329    predicted 1.56685e-06             -13.366443

natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     mean_hlc log_transform_mean_hlc
#>             <char>        <num>                  <num>
#>   1: DTXSID0020153 2.918720e-03              -5.836610
#>   2: DTXSID0020448 2.806465e-03              -5.875830
#>   3: DTXSID0020523 8.611785e-08             -16.267549
#>   4: DTXSID0020529 5.417975e-08             -16.730959
#>   5: DTXSID0020600 1.479610e-04              -8.818562
#>  ---                                                  
#> 141: DTXSID9020293 2.943670e-06             -12.735853
#> 142: DTXSID9020299 4.846860e-10             -21.447520
#> 143: DTXSID9020827 2.040880e-07             -15.404715
#> 144: DTXSID9021138 5.672420e-08             -16.685065
#> 145: DTXSID9041522 8.981400e-06             -11.620355
natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc
#>             <char>       <char>       <num>                  <num>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610
#>   2: DTXSID0020448 experimental 2.82000e-03              -5.871018
#>   3: DTXSID0020448    predicted 2.79293e-03              -5.880664
#>   4: DTXSID0020523 experimental 8.59999e-08             -16.268920
#>   5: DTXSID0020523    predicted 8.62358e-08             -16.266180
#>  ---                                                              
#> 207: DTXSID9020299    predicted 4.84686e-10             -21.447520
#> 208: DTXSID9020827 experimental 2.03000e-07             -15.410060
#> 209: DTXSID9020827    predicted 2.05176e-07             -15.399398
#> 210: DTXSID9021138    predicted 5.67242e-08             -16.685065
#> 211: DTXSID9041522    predicted 8.98140e-06             -11.620355

We compare both sets simultaneously. We add in a column to each data.table denoting to which set the rows correspond and then rbind() the rows together.

Label and combine data.

ccl4_hlc_grouped[, set := 'CCL4']
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc    set
#>             <char>       <char>       <num>                  <num> <char>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610   CCL4
#>   2: DTXSID0020446    predicted 1.52783e-09             -20.299417   CCL4
#>   3: DTXSID0020573    predicted 3.74887e-06             -12.494056   CCL4
#>   4: DTXSID0020600 experimental 1.48000e-04              -8.818298   CCL4
#>   5: DTXSID0020600    predicted 1.47922e-04              -8.818825   CCL4
#>  ---                                                                     
#> 108: DTXSID9023914 experimental 5.03002e-11             -23.713012   CCL4
#> 109: DTXSID9023914    predicted 5.04106e-11             -23.710820   CCL4
#> 110: DTXSID9024142    predicted 2.69276e-06             -12.824944   CCL4
#> 111: DTXSID9032113    predicted 3.09832e-07             -14.987236   CCL4
#> 112: DTXSID9032329    predicted 1.56685e-06             -13.366443   CCL4
natadb_hlc_grouped[, set := 'NATADB']
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc    set
#>             <char>       <char>       <num>                  <num> <char>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610 NATADB
#>   2: DTXSID0020448 experimental 2.82000e-03              -5.871018 NATADB
#>   3: DTXSID0020448    predicted 2.79293e-03              -5.880664 NATADB
#>   4: DTXSID0020523 experimental 8.59999e-08             -16.268920 NATADB
#>   5: DTXSID0020523    predicted 8.62358e-08             -16.266180 NATADB
#>  ---                                                                     
#> 207: DTXSID9020299    predicted 4.84686e-10             -21.447520 NATADB
#> 208: DTXSID9020827 experimental 2.03000e-07             -15.410060 NATADB
#> 209: DTXSID9020827    predicted 2.05176e-07             -15.399398 NATADB
#> 210: DTXSID9021138    predicted 5.67242e-08             -16.685065 NATADB
#> 211: DTXSID9041522    predicted 8.98140e-06             -11.620355 NATADB

all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped)

Plot data.

hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) + 
  geom_boxplot(aes(color = propType))
hlc_box


hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) +
  geom_boxplot(aes(color = set)) +
  coord_flip()
hlc

Again, we observe that in both grouping by propType and aggregating all results together by data set, that the chemicals in NATADB have a generally higher mean Henry’s Law Constant value than those in CCL4.

Finally, we consider boiling point.

Group by DTXSID.

ccl4_boiling_all <- ccl4_phys_chem[propertyId %in% 'boiling-point', 
                                   .(mean_boiling_point = sapply(.SD, mean)), 
                                   .SDcols = c('value'), by = .(dtxsid)]
natadb_boiling_all <- natadb_phys_chem[propertyId %in% 'boiling-point', 
                                       .(mean_boiling_point = 
                                           sapply(.SD, mean)), 
                                       .SDcols = c('value'), by = .(dtxsid)]

Group by DTXSID and property type.

ccl4_boiling_grouped <- ccl4_phys_chem[propertyId %in% 'boiling-point', 
                                       .(mean_boiling_point = 
                                           sapply(.SD, mean)), 
                                       .SDcols = c('value'), 
                                       by = .(dtxsid, propType)]
natadb_boiling_grouped <- natadb_phys_chem[propertyId %in% 'boiling-point', 
                                           .(mean_boiling_point = 
                                               sapply(.SD, mean)), 
                                           .SDcols = c('value'), 
                                           by = .(dtxsid, propType)]

Calculate summary statistics.

summary(ccl4_boiling_all)
#>     dtxsid          mean_boiling_point
#>  Length:95          Min.   : -34.92   
#>  Class :character   1st Qu.: 167.40   
#>  Mode  :character   Median : 306.38   
#>                     Mean   : 348.58   
#>                     3rd Qu.: 390.63   
#>                     Max.   :3377.66
summary(ccl4_boiling_grouped)
#>     dtxsid            propType         mean_boiling_point
#>  Length:147         Length:147         Min.   : -40.78   
#>  Class :character   Class :character   1st Qu.: 117.29   
#>  Mode  :character   Mode  :character   Median : 210.08   
#>                                        Mean   : 317.60   
#>                                        3rd Qu.: 382.80   
#>                                        Max.   :4825.00
summary(natadb_boiling_all)
#>     dtxsid          mean_boiling_point
#>  Length:153         Min.   :-38.45    
#>  Class :character   1st Qu.: 85.98    
#>  Mode  :character   Median :185.60    
#>                     Mean   :183.61    
#>                     3rd Qu.:265.50    
#>                     Max.   :584.48
summary(natadb_boiling_grouped)
#>     dtxsid            propType         mean_boiling_point
#>  Length:296         Length:296         Min.   :-87.78    
#>  Class :character   Class :character   1st Qu.: 82.73    
#>  Mode  :character   Mode  :character   Median :179.83    
#>                                        Mean   :177.10    
#>                                        3rd Qu.:254.77    
#>                                        Max.   :685.00

Since some of the boiling point values have negative values, we cannot log transform these values. If we try, as you will see below, there will be warnings of NaNs produced.

ccl4_boiling_all[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#>             dtxsid mean_boiling_point log_transform
#>             <char>              <num>         <num>
#>  1:  DTXSID0020153         179.515667     5.1902625
#>  2:  DTXSID0020446         287.518667     5.6612878
#>  3:  DTXSID0020573         398.205250     5.9869676
#>  4:  DTXSID0020600          14.330786     2.6624101
#>  5:  DTXSID0020814         395.859500     5.9810594
#>  6:  DTXSID0021464         129.352000     4.8625374
#>  7:  DTXSID0021541         -17.681056           NaN
#>  8:  DTXSID0021917          69.413390     4.2400798
#>  9:  DTXSID0024052         371.968000     5.9188078
#> 10:  DTXSID0032578         382.199000     5.9459414
#> 11:  DTXSID1020437          58.122722     4.0625567
#> 12:  DTXSID1021407         226.088833     5.4209280
#> 13:  DTXSID1021409         418.990000     6.0378471
#> 14:  DTXSID1021740         117.328000     4.7649734
#> 15:  DTXSID1021798         237.391429     5.4697104
#> 16:  DTXSID1024174         381.107333     5.9430811
#> 17:  DTXSID1024207        3377.660000     8.1249384
#> 18:  DTXSID1024338         360.932000     5.8886896
#> 19:  DTXSID1026164         203.990250     5.3180722
#> 20:  DTXSID1031040        2160.993333     7.6783233
#> 21:  DTXSID1037484         374.193000     5.9247717
#> 22:  DTXSID1037486         374.766500     5.9263032
#> 23:  DTXSID1037567         401.491667     5.9951868
#> 24:  DTXSID2020684         276.241250     5.6212746
#> 25:  DTXSID2021028         181.991333     5.2039591
#> 26:  DTXSID2021317         130.530000     4.8716031
#> 27:  DTXSID2021731          53.497800     3.9796405
#> 28:  DTXSID2022333         174.655667     5.1628164
#> 29:  DTXSID2024169        1514.066667     7.3225545
#> 30:  DTXSID2037506         329.383000     5.7972212
#> 31:  DTXSID2040282        1741.490000     7.4624963
#> 32:  DTXSID2052156         414.215250     6.0263858
#> 33:  DTXSID3020203           1.270948     0.2397629
#> 34:  DTXSID3020702         242.027667     5.4890520
#> 35:  DTXSID3020833          55.992829     4.0252236
#> 36:  DTXSID3020964         210.665625     5.3502722
#> 37:  DTXSID3024366          89.453218     4.4937158
#> 38:  DTXSID3024869         200.921000     5.3029118
#> 39:  DTXSID3031864         195.536286     5.2757460
#> 40:  DTXSID3032464         389.168333     5.9640120
#> 41:  DTXSID3034458         472.612000     6.1582748
#> 42:  DTXSID3042219         162.024857     5.0877498
#> 43:  DTXSID3073137         497.040000     6.2086705
#> 44:  DTXSID3074313         453.378333     6.1167269
#> 45:  DTXSID4020533         102.426556     4.6291460
#> 46:  DTXSID4021503          70.806838     4.2599556
#> 47:  DTXSID4022361         383.492250     5.9493194
#> 48:  DTXSID4022367         386.137750     5.9561942
#> 49:  DTXSID4022448         353.556250     5.8680426
#> 50:  DTXSID4022991         654.551500     6.4839503
#> 51:  DTXSID4032611         306.376333     5.7248142
#> 52:  DTXSID4034948         427.915500     6.0589257
#> 53:  DTXSID5020023          53.273463     3.9754383
#> 54:  DTXSID5020576         410.481250     6.0173303
#> 55:  DTXSID5020601         284.046000     5.6491362
#> 56:  DTXSID5021207          37.409986     3.6219377
#> 57:  DTXSID5024182         122.936286     4.8116662
#> 58:  DTXSID5039224          22.668140     3.1209604
#> 59: DTXSID50867064         270.425500     5.5999966
#> 60:  DTXSID6020301         -34.922650           NaN
#> 61:  DTXSID6020856         201.126625     5.3039347
#> 62:  DTXSID6021030         327.147500     5.7904111
#> 63:  DTXSID6021032         213.844200     5.3652477
#> 64:  DTXSID6022422         352.294375     5.8644671
#> 65:  DTXSID6024177         223.431333     5.4091041
#> 66:  DTXSID6037483         392.672667     5.9729764
#> 67:  DTXSID6037485         392.092000     5.9714965
#> 68:  DTXSID6037568         377.399500     5.9333043
#> 69:  DTXSID7020005         202.910286     5.3127639
#> 70:  DTXSID7020637         -14.211583           NaN
#> 71:  DTXSID7021029         154.793429     5.0420915
#> 72:  DTXSID7024241         373.244000     5.9222324
#> 73:  DTXSID7047433         392.270750     5.9719523
#> 74:  DTXSID8020044          94.263900     4.5460983
#> 75:  DTXSID8020090         185.053250     5.2206436
#> 76:  DTXSID8020597         185.598000     5.2235830
#> 77:  DTXSID8020832           7.726314     2.0446319
#> 78:  DTXSID8021062         210.866600     5.3512257
#> 79:  DTXSID8022292         383.472400     5.9492676
#> 80:  DTXSID8022377         398.205250     5.9869676
#> 81:  DTXSID8023846         313.631500     5.7482187
#> 82:  DTXSID8023848         351.873750     5.8632724
#> 83:  DTXSID8025541         344.445667     5.8419364
#> 84:  DTXSID8031865         190.962750     5.2520784
#> 85:  DTXSID8052483         482.980000     6.1799752
#> 86:  DTXSID9020243         351.292333     5.8616187
#> 87:  DTXSID9021390         155.947500     5.0495194
#> 88:  DTXSID9021427         172.769429     5.1519579
#> 89:  DTXSID9022366         418.984000     6.0378327
#> 90:  DTXSID9023380         384.037500     5.9507402
#> 91:  DTXSID9023914         330.833000     5.8016137
#> 92:  DTXSID9024142         452.011333     6.1137073
#> 93:  DTXSID9032113         384.574500     5.9521375
#> 94:  DTXSID9032119         736.490000     6.6018957
#> 95:  DTXSID9032329         448.429667     6.1057519
#>             dtxsid mean_boiling_point log_transform
ccl4_boiling_grouped[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#>             dtxsid     propType mean_boiling_point log_transform
#>             <char>       <char>              <num>         <num>
#>   1: DTXSID0020153 experimental           178.9778      5.187262
#>   2: DTXSID0020153    predicted           180.1880      5.194001
#>   3: DTXSID0020446 experimental           182.5000      5.206750
#>   4: DTXSID0020446    predicted           340.0280      5.829028
#>   5: DTXSID0020573    predicted           398.2052      5.986968
#>  ---                                                            
#> 143: DTXSID9024142    predicted           452.0113      6.113707
#> 144: DTXSID9032113    predicted           384.5745      5.952138
#> 145: DTXSID9032119 experimental           990.0000      6.897705
#> 146: DTXSID9032119    predicted           482.9800      6.179975
#> 147: DTXSID9032329    predicted           448.4297      6.105752

natadb_boiling_all[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#>             dtxsid mean_boiling_point log_transform
#>             <char>              <num>         <num>
#>   1: DTXSID0020153          179.51567      5.190262
#>   2: DTXSID0020448           96.89345      4.573612
#>   3: DTXSID0020523          286.48440      5.657684
#>   4: DTXSID0020529          301.62250      5.709176
#>   5: DTXSID0020600           14.33079      2.662410
#>  ---                                               
#> 149: DTXSID9020299          347.78160      5.851575
#> 150: DTXSID9020827          372.92120      5.921367
#> 151: DTXSID9021138          265.49863      5.581610
#> 152: DTXSID9021261          584.48000      6.370723
#> 153: DTXSID9041522          336.67514      5.819118
natadb_boiling_grouped[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#>             dtxsid     propType mean_boiling_point log_transform
#>             <char>       <char>              <num>         <num>
#>   1: DTXSID0020153 experimental          178.97780      5.187262
#>   2: DTXSID0020153    predicted          180.18800      5.194001
#>   3: DTXSID0020448 experimental           96.14168      4.565823
#>   4: DTXSID0020448    predicted           97.64523      4.581341
#>   5: DTXSID0020523 experimental          113.00000      4.727388
#>  ---                                                            
#> 292: DTXSID9021138    predicted          263.44175      5.573832
#> 293: DTXSID9021261 experimental          685.00000      6.529419
#> 294: DTXSID9021261    predicted          483.96000      6.182002
#> 295: DTXSID9041522 experimental          340.00000      5.828946
#> 296: DTXSID9041522    predicted          334.18150      5.811684

We compare both sets simultaneously. We add in a column to each data.table denoting to which set the rows correspond and then rbind() the rows together. We use the values as is rather than transforming them.

Label and combine data.

ccl4_boiling_grouped[, set := 'CCL4']
#>             dtxsid     propType mean_boiling_point log_transform    set
#>             <char>       <char>              <num>         <num> <char>
#>   1: DTXSID0020153 experimental           178.9778      5.187262   CCL4
#>   2: DTXSID0020153    predicted           180.1880      5.194001   CCL4
#>   3: DTXSID0020446 experimental           182.5000      5.206750   CCL4
#>   4: DTXSID0020446    predicted           340.0280      5.829028   CCL4
#>   5: DTXSID0020573    predicted           398.2052      5.986968   CCL4
#>  ---                                                                   
#> 143: DTXSID9024142    predicted           452.0113      6.113707   CCL4
#> 144: DTXSID9032113    predicted           384.5745      5.952138   CCL4
#> 145: DTXSID9032119 experimental           990.0000      6.897705   CCL4
#> 146: DTXSID9032119    predicted           482.9800      6.179975   CCL4
#> 147: DTXSID9032329    predicted           448.4297      6.105752   CCL4
natadb_boiling_grouped[, set := 'NATADB']
#>             dtxsid     propType mean_boiling_point log_transform    set
#>             <char>       <char>              <num>         <num> <char>
#>   1: DTXSID0020153 experimental          178.97780      5.187262 NATADB
#>   2: DTXSID0020153    predicted          180.18800      5.194001 NATADB
#>   3: DTXSID0020448 experimental           96.14168      4.565823 NATADB
#>   4: DTXSID0020448    predicted           97.64523      4.581341 NATADB
#>   5: DTXSID0020523 experimental          113.00000      4.727388 NATADB
#>  ---                                                                   
#> 292: DTXSID9021138    predicted          263.44175      5.573832 NATADB
#> 293: DTXSID9021261 experimental          685.00000      6.529419 NATADB
#> 294: DTXSID9021261    predicted          483.96000      6.182002 NATADB
#> 295: DTXSID9041522 experimental          340.00000      5.828946 NATADB
#> 296: DTXSID9041522    predicted          334.18150      5.811684 NATADB

all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped)

Plot the data.

boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) + 
  geom_boxplot(aes(color = propType))
boiling_box


boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) +
  geom_boxplot(aes(color = set)) + 
  coord_flip()
boiling

A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the predicted values for each data set tend to be higher than the experimental. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two data sets as it was in the previous cases of physico-chemical properties we investigated.

Answer to Environmental Health Question 2

Through inspecting the last several plots, we can answer Environmental Health Question 2: The physico-chemical property data are reported with both experimental and predicted values present for many chemicals. Are there differences between the mean predicted and experimental results for a variety of physico-chemical properties?

Answer: There are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental. In the case of “vapor-pressure”, the means of predicted values tend to be a little lower than experimental, though they are much closer in the case of NATADB than CCL4. The trend of lower predicted means compared to experimental means is more clearly demonstrated for “henrys-law” values in both data sets. In the case of “boiling-point”, the predicted values are greater than the experimental values, though this is much more pronounced in CCL4 while the set of means for NATADB are again fairly close.

Hazard Data: Genotoxicity

Now, having examined some of the distributions of the physico-chemical properties of the two lists, aggregated between predicted and experimental, we move towards learning more about these chemicals beyond physico-chemical properties. Specifically, we will examine their genotoxicity.

Using the standard CompTox Chemicals Dashboard approach to access genotoxicity, one would again navigate to the individual chemical page

Once one navigates to the genotoxicity tab highlighted in the previous page, the following is displayed as seen here:

This page includes two sets of information, the first of which provides a summary of available genotoxicity data while the second provides the individual reports and samples of such data.

We again use the CTX APIs to streamline the process of retrieving this information in a programmatic fashion. To this end, we will use the genotoxicity endpoints found within the Hazard endpoints of the CTX APIs. Pictured below is the particular set of genotoxicity resources available in the Hazard endpoints of the CTX APIs.

There are both summary and detail resources, reflecting the information one can find on the CompTox Chemicals Dashboard Genotoxicity page for a given chemical.

To access the genetox endpoint, we will use the function get_genetox_summary(). Since we have a list of chemicals, rather than searching individually for each chemical, we use the batch search version of the function, named get_genetox_summary_batch(). We will examine this and then access the details.

Grab the data using the APIs.

ccl4_genotox <- get_genetox_summary_batch(DTXSID = ccl4$dtxsid)
natadb_genetox <- get_genetox_summary_batch(DTXSID = natadb$dtxsid)

Examine the dimensions.

dim(ccl4_genotox)
#> [1] 71  7
dim(natadb_genetox)
#> [1] 153   7

Examine the column names and data from the first six chemicals with genetox data from CCL4.

colnames(ccl4_genotox)
#> [1] "id"              "dtxsid"          "reportsPositive" "reportsNegative" "reportsOther"   
#> [6] "ames"            "micronucleus"
head(ccl4_genotox)
#>        id        dtxsid reportsPositive reportsNegative reportsOther   ames micronucleus
#>     <int>        <char>           <int>           <int>        <int> <char>       <char>
#> 1: 518256 DTXSID0020153              26               5            1   <NA>     positive
#> 2: 522227 DTXSID0020446               0              12            0   <NA>     negative
#> 3: 523058 DTXSID0020573               3              14            0   <NA>     negative
#> 4: 525830 DTXSID0020600              20               0            1   <NA>     positive
#> 5: 525597 DTXSID0020814               1               0            0   <NA>         <NA>
#> 6: 520045 DTXSID0021464               9              11            0   <NA>     positive

The information returned is of the first variety highlighted in the image above, that is, the summary data on the available genotoxicity data for each chemical.

Observe that we have information on 71 chemicals from the CCL4 data and 153 from the NATA data. We note the chemicals not included in the results and then dig into the returned results.

ccl4[!(dtxsid %in% ccl4_genotox$dtxsid), 
     .(dtxsid, casrn, preferredName, molFormula)]
#>              dtxsid       casrn             preferredName    molFormula
#>              <char>      <char>                    <char>        <char>
#>  1: DTXSID001024118  77238-39-2               Microcystin          <NA>
#>  2:   DTXSID0024052  55290-64-7                Dimethipin     C6H10O4S2
#>  3:   DTXSID0032578  59669-26-0                Thiodicarb  C10H18N4O4S3
#>  4:   DTXSID1037484 194992-44-4             Acetochlor OA     C14H19NO4
#>  5:   DTXSID1037486 171262-17-2 2-[(2,6-Diethylphenyl)(me     C14H19NO4
#>  6:   DTXSID1037567 171118-09-5           Metolachlor ESA    C15H23NO5S
#>  7:   DTXSID2022333    135-98-8          sec-Butylbenzene        C10H14
#>  8:   DTXSID2031083 143545-90-8        Cylindrospermopsin   C15H21N5O7S
#>  9:   DTXSID2037506  16655-82-6       3-Hydroxycarbofuran     C12H15NO4
#> 10:   DTXSID2052156    517-09-9                 Equilenin      C18H18O2
#> 11:   DTXSID3021857  25154-52-3             n-Nonylphenol       C15H24O
#> 12:   DTXSID3034458  99129-21-2                 Clethodim  C17H26ClNO3S
#> 13:   DTXSID3042219    103-65-1             Propylbenzene         C9H12
#> 14:   DTXSID3073137  14866-68-3                  Chlorate          ClO3
#> 15:   DTXSID3074313  35523-89-8                 Saxitoxin    C10H17N7O4
#> 16:   DTXSID4022448  51218-45-2               Metolachlor   C15H22ClNO2
#> 17:   DTXSID4032611  13194-48-4                  Ethoprop    C8H19O2PS2
#> 18:   DTXSID4034948 112410-23-8              Tebufenozide    C22H28N2O2
#> 19:  DTXSID50867064  64285-06-9                Anatoxin a      C10H15NO
#> 20:   DTXSID6024177  10265-92-6             Methamidophos     C2H8NO2PS
#> 21:   DTXSID6037483 187022-11-3            Acetochlor ESA    C14H21NO5S
#> 22:   DTXSID6037485 142363-53-9              Alachlor ESA    C14H21NO5S
#> 23:   DTXSID6037568 152019-73-3            Metolachlor OA     C15H21NO4
#> 24:   DTXSID7024241  42874-03-3               Oxyfluorfen C15H11ClF3NO4
#> 25:   DTXSID7047433    474-86-2                   Equilin      C18H20O2
#> 26:   DTXSID8022377     57-91-0         17alpha-Estradiol      C18H24O2
#> 27:   DTXSID8052483   7440-56-4                 Germanium            Ge
#> 28:   DTXSID9032113 107534-96-3              Tebuconazole   C16H22ClN3O
#> 29:   DTXSID9032329    741-58-2                 Bensulide  C14H24NO4PS3
#>              dtxsid       casrn             preferredName    molFormula
natadb[!(dtxsid %in% natadb_genetox$dtxsid), 
       .(dtxsid, casrn, preferredName, molFormula)]
#>             dtxsid        casrn             preferredName molFormula
#>             <char>       <char>                    <char>     <char>
#>  1: DTXSID00872421 NOCAS_872421     Lead & Lead Compounds       <NA>
#>  2:  DTXSID1020273    7782-50-5                  Chlorine        Cl2
#>  3: DTXSID10872417 NOCAS_872417 Cadmium & Cadmium Compoun       <NA>
#>  4: DTXSID30872414 NOCAS_872414 Antimony & Antimony Compo       <NA>
#>  5: DTXSID30872419 NOCAS_872419 Cobalt & Cobalt Compounds       <NA>
#>  6: DTXSID40872425 NOCAS_872425 Nickel & Nickel Compounds       <NA>
#>  7:  DTXSID5024267    1336-36-3 Polychlorinated biphenyls       <NA>
#>  8:  DTXSID7020687     608-73-1 1,2,3,4,5,6-Hexachlorocyc    C6H6Cl6
#>  9:  DTXSID7023984  NOCAS_23984       Coke oven emissions       <NA>
#> 10: DTXSID90872415 NOCAS_872415 Arsenic & Arsenic Compoun       <NA>

Now, we access the genotoxicity details of the chemicals in each data set using the function get_genetox_details_batch(). We explore the dimensions of the returned queries, the column names, and the first few lines of the data.

Grab the data from the CTX APIs.

ccl4_genetox_details <- get_genetox_details_batch(DTXSID = ccl4$dtxsid)
natadb_genetox_details <- get_genetox_details_batch(DTXSID = natadb$dtxsid)

Examine the dimensions.

dim(ccl4_genetox_details)
#> [1] 1041   10
dim(natadb_genetox_details)
#> [1] 2647   10

Look at the column names and the first six rows of the data from the CCL4 chemicals.

colnames(ccl4_genetox_details)
#>  [1] "id"                  "source"              "year"                "dtxsid"             
#>  [5] "strain"              "species"             "assayCategory"       "assayType"          
#>  [9] "metabolicActivation" "assayResult"
head(ccl4_genetox_details)
#>         id      source  year        dtxsid                   strain                   species
#>      <int>      <char> <int>        <char>                   <char>                    <char>
#> 1: 1182642   CSCL_ISHL    NA DTXSID0020153                     <NA>                      <NA>
#> 2: 1182633   CSCL_ISHL    NA DTXSID0020153                     <NA>                      <NA>
#> 3: 1182649 eChemPortal  1976 DTXSID0020153                     <NA>  Saccharomyces cerevisiae
#> 4: 1182644 eChemPortal  1976 DTXSID0020153                     <NA> S. typhimurium TA 1535, T
#> 5: 1182645 eChemPortal  1976 DTXSID0020153                     <NA>    S. typhimurium TA 1538
#> 6: 1182650 eChemPortal  1982 DTXSID0020153 other: Tuck To (outbred)                     mouse
#> 4 variable(s) not shown: [assayCategory <char>, assayType <char>, metabolicActivation <char>, assayResult <char>]

We examine the information returned for the first chemical in each set of results, which is DTXSID0020153. Notice that the information is identical in each case as this information is chemical specific and not data set specific.

Look at the dimensions first.

dim(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#> [1] 28 10
dim(natadb_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#> [1] 28 10

Now examine the first few rows.

head(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#>         id      source  year        dtxsid                   strain                   species
#>      <int>      <char> <int>        <char>                   <char>                    <char>
#> 1: 1182642   CSCL_ISHL    NA DTXSID0020153                     <NA>                      <NA>
#> 2: 1182633   CSCL_ISHL    NA DTXSID0020153                     <NA>                      <NA>
#> 3: 1182649 eChemPortal  1976 DTXSID0020153                     <NA>  Saccharomyces cerevisiae
#> 4: 1182644 eChemPortal  1976 DTXSID0020153                     <NA> S. typhimurium TA 1535, T
#> 5: 1182645 eChemPortal  1976 DTXSID0020153                     <NA>    S. typhimurium TA 1538
#> 6: 1182650 eChemPortal  1982 DTXSID0020153 other: Tuck To (outbred)                     mouse
#> 4 variable(s) not shown: [assayCategory <char>, assayType <char>, metabolicActivation <char>, assayResult <char>]

Observe that the data is the same for each data set when restricting to the same chemical. This is because the information we are retrieving is specific to the chemical and not dependent on the chemical lists to which the chemical may belong.

identical(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ], 
          natadb_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#> [1] TRUE

We now explore the assays present for chemicals in each data set. We first determine the unique values of the assayCategory column and then group by these values and determine the number of unique assays for each assayCategory value.

Determine the unique assay categories.

ccl4_genetox_details[, unique(assayCategory)]
#> [1] "in vitro" "ND"       "in vivo"
natadb_genetox_details[, unique(assayCategory)]
#> [1] "in vitro" "ND"       "in vivo"

Determine the unique assays for each data set and list them.

ccl4_genetox_details[, unique(assayType)]
#>  [1] "InVivoMN"                                                                                                                              
#>  [2] "Overall"                                                                                                                               
#>  [3] "bacterial reverse mutation assay"                                                                                                      
#>  [4] "micronucleus assay"                                                                                                                    
#>  [5] "Ames"                                                                                                                                  
#>  [6] "InVitroCA"                                                                                                                             
#>  [7] "InVitroMLA"                                                                                                                            
#>  [8] "InVitroMN"                                                                                                                             
#>  [9] "Cell transformation, clonal assay"                                                                                                     
#> [10] "Forward and reverse gene mutation, host-mediated assay"                                                                                
#> [11] "Histidine reverse gene mutation, Ames assay"                                                                                           
#> [12] "Micronucleus test, chromosome aberrations"                                                                                             
#> [13] "Mitotic recombination or gene conversion"                                                                                              
#> [14] "Rec-assay, DNA effects (bacterial DNA repair)"                                                                                         
#> [15] "Rec-assay, spot test, DNA effects (bacterial DNA repair)"                                                                              
#> [16] "Sister-chromatid exchange (SCE) in vitro"                                                                                              
#> [17] "Unscheduled DNA synthesis (UDS) in vitro, DNA effects"                                                                                 
#> [18] "In vivo carcinogenicity studies"                                                                                                       
#> [19] "in vitro mammalian chromosome aberration test"                                                                                         
#> [20] "mammalian cell gene mutation assay"                                                                                                    
#> [21] "DNA damage and repair assay, unscheduled DNA synthesis in mammalian cells in vitro"                                                    
#> [22] "in vivo micronucleus (mouse)"                                                                                                          
#> [23] "in vivo micronucleus (rat)"                                                                                                            
#> [24] "Sperm morphology"                                                                                                                      
#> [25] "InVivoCA"                                                                                                                              
#> [26] "InVivoUDS"                                                                                                                             
#> [27] "transgenic"                                                                                                                            
#> [28] "Chromosome aberrations"                                                                                                                
#> [29] "Forward gene mutation at the HPRT locus"                                                                                               
#> [30] "Heritable translocation test, chromosome aberrations"                                                                                  
#> [31] "Reverse gene mutation"                                                                                                                 
#> [32] "Sex-linked recessive lethal gene mutation"                                                                                             
#> [33] "Sister-chromatid exchange (SCE) in vivo"                                                                                               
#> [34] "Dominant lethal test"                                                                                                                  
#> [35] "Unscheduled DNA synthesis (UDS) in vivo; DNA effects"                                                                                  
#> [36] "chromosome aberration assay"                                                                                                           
#> [37] "mammalian germ cell cytogenetic assay"                                                                                                 
#> [38] "bacterial forward mutation assay"                                                                                                      
#> [39] "sister chromatid exchange assay in mammalian cells"                                                                                    
#> [40] "DNA Binding"                                                                                                                           
#> [41] "rodent dominant lethal assay"                                                                                                          
#> [42] "unscheduled DNA synthesis"                                                                                                             
#> [43] "Bacterial Mutagenesis"                                                                                                                 
#> [44] "Cytogenetics Other"                                                                                                                    
#> [45] "Cytotoxicity"                                                                                                                          
#> [46] "In Vitro Micronucleus"                                                                                                                 
#> [47] "bacterial gene mutation assay"                                                                                                         
#> [48] "in vitro mammalian cell micronucleus test"                                                                                             
#> [49] "Aneuploidy, chromosome aberrations"                                                                                                    
#> [50] "Chromosome aberrations in vivo"                                                                                                        
#> [51] "sister chromatid exchange assay"                                                                                                       
#> [52] "InVivoDNADamage"                                                                                                                       
#> [53] "Cell transformation, viral enhanced"                                                                                                   
#> [54] "combined chromosome aberration and micronucleus assay"                                                                                 
#> [55] "Chromosome aberrations in vitro"                                                                                                       
#> [56] "Forward gene mutation"                                                                                                                 
#> [57] "Forward gene mutation at the HPRT or ouabain locus"                                                                                    
#> [58] "Forward gene mutation at the thymidine kinase (TK) locus; chromosome aberrations"                                                      
#> [59] "Specific locus test, gene mutation"                                                                                                    
#> [60] "Spot test, gene mutation"                                                                                                              
#> [61] "In Vivo Non-mammalian Mutagenesis"                                                                                                     
#> [62] "In Vivo Micronucleus"                                                                                                                  
#> [63] "mouse spot test"                                                                                                                       
#> [64] "transgenic rodent mutagenicity assay"                                                                                                  
#> [65] "yeast cytogenetic assay"                                                                                                               
#> [66] "Micronucleus and sister chromatid exchange"                                                                                            
#> [67] "in vivo comet (mouse)"                                                                                                                 
#> [68] "in vivo comet (rat)"                                                                                                                   
#> [69] "Gene mutation"                                                                                                                         
#> [70] "in vitro mammalian cell transformation assay"                                                                                          
#> [71] "Cell transformation"                                                                                                                   
#> [72] "Tryptophan reverse gene mutation"                                                                                                      
#> [73] "Cell Transformation"                                                                                                                   
#> [74] "DNA Damage/Repair"                                                                                                                     
#> [75] "In Vitro Chromosome Aberration"                                                                                                        
#> [76] "Mutation"                                                                                                                              
#> [77] "DNA Covalent Binding"                                                                                                                  
#> [78] "In Vivo Chromosome Aberration"                                                                                                         
#> [79] "In Vivo Mammalian Mutagenesis"                                                                                                         
#> [80] "in vitro chromosomal aberration study in mammalian cells"                                                                              
#> [81] "Evaluation of metabolic activity of acute cytotoxicity"                                                                                
#> [82] "In vitro mammalian chromosomal aberration test"                                                                                        
#> [83] "Mutation Other"                                                                                                                        
#> [84] "Forward and reverse gene mutation, body fluid assay"                                                                                   
#> [85] "Forward and reverse gene mutation, chromosome aberrations, mitotic recombination and gene conversion, DNA effects, host-mediated assay"
#> [86] "Chromosomal aberration assay"                                                                                                          
#> [87] "Mitotic recombination"                                                                                                                 
#> [88] "Aneuploidy, sex chromosome gain, chromosome aberrations"                                                                               
#> [89] "Aneuploidy, whole sex chromosome loss, chromosome aberrations"                                                                         
#> [90] "fluctuation test"

natadb_genetox_details[, unique(assayType)]
#>   [1] "InVivoMN"                                                                                                                                   
#>   [2] "Overall"                                                                                                                                    
#>   [3] "bacterial reverse mutation assay"                                                                                                           
#>   [4] "micronucleus assay"                                                                                                                         
#>   [5] "Ames"                                                                                                                                       
#>   [6] "InVitroCA"                                                                                                                                  
#>   [7] "InVitroMLA"                                                                                                                                 
#>   [8] "InVitroMN"                                                                                                                                  
#>   [9] "Cell transformation, clonal assay"                                                                                                          
#>  [10] "Forward and reverse gene mutation, host-mediated assay"                                                                                     
#>  [11] "Histidine reverse gene mutation, Ames assay"                                                                                                
#>  [12] "Micronucleus test, chromosome aberrations"                                                                                                  
#>  [13] "Mitotic recombination or gene conversion"                                                                                                   
#>  [14] "Rec-assay, DNA effects (bacterial DNA repair)"                                                                                              
#>  [15] "Rec-assay, spot test, DNA effects (bacterial DNA repair)"                                                                                   
#>  [16] "Sister-chromatid exchange (SCE) in vitro"                                                                                                   
#>  [17] "Unscheduled DNA synthesis (UDS) in vitro, DNA effects"                                                                                      
#>  [18] "In vivo carcinogenicity studies"                                                                                                            
#>  [19] "DNA damage and repair assay, unscheduled DNA synthesis in mammalian cells in vitro"                                                         
#>  [20] "rodent dominant lethal assay"                                                                                                               
#>  [21] "Chromosome aberrations"                                                                                                                     
#>  [22] "Gene mutation"                                                                                                                              
#>  [23] "InVivoUDS"                                                                                                                                  
#>  [24] "InVivoCA"                                                                                                                                   
#>  [25] "transgenic"                                                                                                                                 
#>  [26] "Forward gene mutation at the HPRT locus"                                                                                                    
#>  [27] "Heritable translocation test, chromosome aberrations"                                                                                       
#>  [28] "Reverse gene mutation"                                                                                                                      
#>  [29] "Sex-linked recessive lethal gene mutation"                                                                                                  
#>  [30] "Sister-chromatid exchange (SCE) in vivo"                                                                                                    
#>  [31] "Dominant lethal test"                                                                                                                       
#>  [32] "Unscheduled DNA synthesis (UDS) in vivo; DNA effects"                                                                                       
#>  [33] "Bacterial Mutagenesis"                                                                                                                      
#>  [34] "Cytogenetics Other"                                                                                                                         
#>  [35] "Cytotoxicity"                                                                                                                               
#>  [36] "DNA Damage/Repair"                                                                                                                          
#>  [37] "In Vitro Chromosome Aberration"                                                                                                             
#>  [38] "In Vitro Micronucleus"                                                                                                                      
#>  [39] "In Vivo Non-mammalian Mutagenesis"                                                                                                          
#>  [40] "Mutation"                                                                                                                                   
#>  [41] "In Vivo Chromosome Aberration"                                                                                                              
#>  [42] "In Vivo Mammalian Mutagenesis"                                                                                                              
#>  [43] "In Vivo Micronucleus"                                                                                                                       
#>  [44] "in vitro mammalian chromosome aberration test"                                                                                              
#>  [45] "InVivoDNADamage"                                                                                                                            
#>  [46] "Cell transformation, viral enhanced"                                                                                                        
#>  [47] "mammalian cell gene mutation assay"                                                                                                         
#>  [48] "in vivo micronucleus (mouse)"                                                                                                               
#>  [49] "Sperm morphology"                                                                                                                           
#>  [50] "Forward and reverse gene mutation, mitotic recombination and gene conversion, host-mediated assay"                                          
#>  [51] "Spot test, gene mutation"                                                                                                                   
#>  [52] "bacterial forward mutation assay"                                                                                                           
#>  [53] "sister chromatid exchange assay in mammalian cells"                                                                                         
#>  [54] "DNA Binding"                                                                                                                                
#>  [55] "unscheduled DNA synthesis"                                                                                                                  
#>  [56] "bacteriophage induction in E. coli, gene mutation, UDS in mammalian cells, sex-linked recessive lethal mutations in Drosophila"             
#>  [57] "DNA damage, gene mutation, reverse mutation, gene conversion, DNA repair, chromosomal aberration, chromatid exchange, UDS"                  
#>  [58] "chromosome aberration study in mammalian cells"                                                                                             
#>  [59] "in vitro mammalian cell transformation assay"                                                                                               
#>  [60] "Forward gene mutation at the thymidine kinase (TK) locus; chromosome aberrations"                                                           
#>  [61] "Cell transformation"                                                                                                                        
#>  [62] "Forward and reverse gene mutation, body fluid assay"                                                                                        
#>  [63] "Forward gene mutation at the HPRT or ouabain locus"                                                                                         
#>  [64] "chromosome aberration assay"                                                                                                                
#>  [65] "Drosophila SLRL assay"                                                                                                                      
#>  [66] "Salmonella and Escherichia strains: bacterial reverse mutation assay (e.g. Ames test) ; Bacillus strains: recombination assay"              
#>  [67] "Cytogenetic assay in bone marrow cells"                                                                                                     
#>  [68] "in vivo comet (mouse)"                                                                                                                      
#>  [69] "Chromosome aberrations in vitro"                                                                                                            
#>  [70] "Forward gene mutation"                                                                                                                      
#>  [71] "Chromosome aberrations in vivo"                                                                                                             
#>  [72] "in vitro mammalian cell gene mutation tests using the thymidine kinase gene"                                                                
#>  [73] "in vivo comet (rat)"                                                                                                                        
#>  [74] "in vivo micronucleus (rat)"                                                                                                                 
#>  [75] "mouse spot test"                                                                                                                            
#>  [76] "Aneuploidy, whole sex chromosome loss, chromosome aberrations"                                                                              
#>  [77] "sister chromatid exchange assay"                                                                                                            
#>  [78] "Mouse Lymphoma Forward Mutation Assay"                                                                                                      
#>  [79] "mammalian erythrocyte micronucleus test"                                                                                                    
#>  [80] "Tryptophan reverse gene mutation"                                                                                                           
#>  [81] "bacterial gene mutation assay"                                                                                                              
#>  [82] "yeast forward mutation and mitotic gene conversion assays in Schizosaccharomyces pombe (P1 strain) and Saccharomyces cerevisiae (D4 strain)"
#>  [83] "Micronucleus test in vitro, chromosome aberrations"                                                                                         
#>  [84] "heritable translocation assay"                                                                                                              
#>  [85] "mitotic recombination assay with Saccharomyces cerevisiae"                                                                                  
#>  [86] "Aneuploidy, chromosome aberrations"                                                                                                         
#>  [87] "cell transformation"                                                                                                                        
#>  [88] "in vitro mammalian cell micronucleus test"                                                                                                  
#>  [89] "somatic mutation and recombination test in Drosophila"                                                                                      
#>  [90] "transgenic rodent mutagenicity assay"                                                                                                       
#>  [91] "yeast cytogenetic assay"                                                                                                                    
#>  [92] "Micronucleus and sister chromatid exchange"                                                                                                 
#>  [93] "in vitro mammalian cell gene mutation test using the Hprt and xprt genes"                                                                   
#>  [94] "bone marrow chromosome aberration assay and mammalian germ cell cytogenetic assay"                                                          
#>  [95] "bacterial mutation"                                                                                                                         
#>  [96] "bacterial reverse mutation assay (Salmonella typhimurium and Escherichia coli)"                                                             
#>  [97] "Aneuploidy, partial sex chromosome loss, chromosome aberrations "                                                                           
#>  [98] "Chromosome aberrations, in vivo"                                                                                                            
#>  [99] "in vitro chromosome aberration study"                                                                                                       
#> [100] "Cell transformation, focus assay"                                                                                                           
#> [101] "Forward and reverse gene mutation, mitotic recombination and gene conversion, DNA effects, host-mediated assay"                             
#> [102] "gene mutation assay in fungi"                                                                                                               
#> [103] "DNA adduct formation"                                                                                                                       
#> [104] "Cell Transformation"                                                                                                                        
#> [105] "DNA Covalent Binding"                                                                                                                       
#> [106] "mammalian comet assay"                                                                                                                      
#> [107] "Aneuploidy, sex chromosome gain, chromosome aberrations"                                                                                    
#> [108] "mammalian germ cell cytogenetic assay"                                                                                                      
#> [109] "Forward and reverse gene mutation, chromosome aberrations, mitotic recombination and gene conversion, DNA effects, host-mediated assay"     
#> [110] "E. coli K-12 DNA repair host-mediated assay"                                                                                                
#> [111] "Chromosomal aberration assay"                                                                                                               
#> [112] "forward mutation"                                                                                                                           
#> [113] "mammalian cell gene mutation test"                                                                                                          
#> [114] "Mitotic recombination"

Determine the number of assays per unique assayCategory value.

ccl4_genetox_details[, .(Assays = length(unique(assayType))), 
                     by = .(assayCategory)]
#>    assayCategory Assays
#>           <char>  <int>
#> 1:      in vitro     65
#> 2:            ND      3
#> 3:       in vivo     22

natadb_genetox_details[, .(Assays = length(unique(assayType))),
                       by = .(assayCategory)]
#>    assayCategory Assays
#>           <char>  <int>
#> 1:      in vitro     83
#> 2:            ND      3
#> 3:       in vivo     28

We can analyze these results more closely, counting the number of assay results and grouping by assayCategory, and assayType. We also examine the different numbers of assayCategory and assayTypes values used.

ccl4_genetox_details[, .N, by = .(assayCategory, assayType, assayResult)]
#>      assayCategory                 assayType assayResult     N
#>             <char>                    <char>      <char> <int>
#>   1:      in vitro                  InVivoMN    negative    10
#>   2:            ND                   Overall    positive     5
#>   3:      in vitro bacterial reverse mutatio    positive    39
#>   4:       in vivo        micronucleus assay    negative    36
#>   5:       in vivo        micronucleus assay   equivocal     1
#>  ---                                                          
#> 149:      in vitro Heritable translocation t    negative     1
#> 150:      in vitro     Mitotic recombination    positive     1
#> 151:      in vitro Aneuploidy, sex chromosom    negative     1
#> 152:      in vitro Aneuploidy, whole sex chr    positive     1
#> 153:      in vitro          fluctuation test    negative     2
ccl4_genetox_details[, .N, by = .(assayCategory)]
#>    assayCategory     N
#>           <char> <int>
#> 1:      in vitro   815
#> 2:            ND    38
#> 3:       in vivo   188

We look at the assayType values and numbers of each for the three different assayCategory values.

ccl4_genetox_details[assayCategory == 'in vitro', .N, by = .(assayType)]
#>                     assayType     N
#>                        <char> <int>
#>  1:                  InVivoMN    28
#>  2: bacterial reverse mutatio   165
#>  3:                      Ames    88
#>  4:                 InVitroCA    31
#>  5:                InVitroMLA    24
#>  6:                 InVitroMN     6
#>  7: Cell transformation, clon     8
#>  8: Forward and reverse gene      7
#>  9: Histidine reverse gene mu    19
#> 10: Micronucleus test, chromo     8
#> 11: Mitotic recombination or     18
#> 12: Rec-assay, DNA effects (b    15
#> 13: Rec-assay, spot test, DNA     2
#> 14: Sister-chromatid exchange    41
#> 15: in vitro mammalian chromo    22
#> 16: mammalian cell gene mutat    42
#> 17: DNA damage and repair ass    21
#> 18:    Chromosome aberrations     2
#> 19: Forward gene mutation at      5
#> 20: Heritable translocation t     5
#> 21:     Reverse gene mutation     9
#> 22: Sex-linked recessive leth     9
#> 23: Sister-chromatid exchange    13
#> 24: chromosome aberration ass    15
#> 25: bacterial forward mutatio     1
#> 26: sister chromatid exchange    11
#> 27:     Bacterial Mutagenesis    27
#> 28:        Cytogenetics Other    26
#> 29:              Cytotoxicity    21
#> 30:     In Vitro Micronucleus     4
#> 31: bacterial gene mutation a     7
#> 32: in vitro mammalian cell m     5
#> 33: Aneuploidy, chromosome ab     5
#> 34: sister chromatid exchange     3
#> 35: Cell transformation, vira    12
#> 36: combined chromosome aberr     1
#> 37: Chromosome aberrations in     2
#> 38:     Forward gene mutation     5
#> 39: Forward gene mutation at      6
#> 40: Forward gene mutation at      2
#> 41: Specific locus test, gene     1
#> 42:  Spot test, gene mutation     1
#> 43: In Vivo Non-mammalian Mut     7
#> 44:           mouse spot test     2
#> 45: transgenic rodent mutagen     1
#> 46:   yeast cytogenetic assay     1
#> 47:             Gene mutation     4
#> 48: in vitro mammalian cell t     2
#> 49:       Cell transformation     5
#> 50: Tryptophan reverse gene m     8
#> 51:       Cell Transformation     2
#> 52:         DNA Damage/Repair     8
#> 53: In Vitro Chromosome Aberr    11
#> 54:                  Mutation     3
#> 55: in vitro chromosomal aber     1
#> 56: Evaluation of metabolic a     1
#> 57: In vitro mammalian chromo     2
#> 58:            Mutation Other     4
#> 59: Forward and reverse gene      2
#> 60: Forward and reverse gene      1
#> 61: Chromosomal aberration as     2
#> 62:     Mitotic recombination     1
#> 63: Aneuploidy, sex chromosom     1
#> 64: Aneuploidy, whole sex chr     1
#> 65:          fluctuation test     2
#>                     assayType     N
ccl4_genetox_details[assayCategory == 'ND', .N, by = .(assayType)]
#>                    assayType     N
#>                       <char> <int>
#> 1:                   Overall     5
#> 2: In vivo carcinogenicity s    23
#> 3:                transgenic    10
ccl4_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
#>                     assayType     N
#>                        <char> <int>
#>  1:        micronucleus assay    46
#>  2: Unscheduled DNA synthesis     9
#>  3: in vivo micronucleus (mou    19
#>  4: in vivo micronucleus (rat     9
#>  5:          Sperm morphology     9
#>  6:                  InVivoCA    14
#>  7:                 InVivoUDS    11
#>  8:      Dominant lethal test     5
#>  9: Unscheduled DNA synthesis     3
#> 10: mammalian germ cell cytog     2
#> 11:               DNA Binding     1
#> 12: rodent dominant lethal as    15
#> 13: unscheduled DNA synthesis     6
#> 14: Chromosome aberrations in     2
#> 15:           InVivoDNADamage     7
#> 16:      In Vivo Micronucleus     1
#> 17: Micronucleus and sister c     2
#> 18:     in vivo comet (mouse)     1
#> 19:       in vivo comet (rat)     3
#> 20:      DNA Covalent Binding    12
#> 21: In Vivo Chromosome Aberra     4
#> 22: In Vivo Mammalian Mutagen     7
#>                     assayType     N

Now we repeat this for NATADB.

natadb_genetox_details[, .N, by = .(assayCategory, assayType, assayResult)]
#>      assayCategory                 assayType assayResult     N
#>             <char>                    <char>      <char> <int>
#>   1:      in vitro                  InVivoMN    negative    40
#>   2:            ND                   Overall    positive    16
#>   3:      in vitro bacterial reverse mutatio    positive    93
#>   4:       in vivo        micronucleus assay    negative    76
#>   5:       in vivo        micronucleus assay   equivocal     4
#>  ---                                                          
#> 194:      in vitro Heritable translocation t    negative     2
#> 195:       in vivo     mammalian comet assay   equivocal     1
#> 196:      in vitro mammalian cell gene mutat    positive     1
#> 197:      in vitro in vitro mammalian cell t    positive     1
#> 198:      in vitro     Mitotic recombination    positive     1
natadb_genetox_details[, .N, by = .(assayCategory)]
#>    assayCategory     N
#>           <char> <int>
#> 1:      in vitro  2112
#> 2:            ND   100
#> 3:       in vivo   435

Examine the number of rows for each assayType value by each assaycategory value.

natadb_genetox_details[assayCategory == 'in vitro', .N, by = .(assayType)]
#>                     assayType     N
#>                        <char> <int>
#>  1:                  InVivoMN    89
#>  2: bacterial reverse mutatio   362
#>  3:                      Ames   258
#>  4:                 InVitroCA    98
#>  5:                InVitroMLA    85
#>  6:                 InVitroMN    20
#>  7: Cell transformation, clon    14
#>  8: Forward and reverse gene     17
#>  9: Histidine reverse gene mu    55
#> 10: Micronucleus test, chromo    33
#> 11: Mitotic recombination or     47
#> 12: Rec-assay, DNA effects (b    34
#> 13: Rec-assay, spot test, DNA     6
#> 14: Sister-chromatid exchange    98
#> 15: DNA damage and repair ass    50
#> 16:    Chromosome aberrations    25
#> 17:             Gene mutation    20
#> 18: Forward gene mutation at     12
#> 19: Heritable translocation t    10
#> 20:     Reverse gene mutation    30
#> 21: Sex-linked recessive leth    26
#> 22: Sister-chromatid exchange    31
#> 23:     Bacterial Mutagenesis    41
#> 24:        Cytogenetics Other    41
#> 25:              Cytotoxicity    20
#> 26:         DNA Damage/Repair    27
#> 27: In Vitro Chromosome Aberr     6
#> 28:     In Vitro Micronucleus     8
#> 29: In Vivo Non-mammalian Mut     7
#> 30:                  Mutation     6
#> 31: in vitro mammalian chromo    91
#> 32: Cell transformation, vira    46
#> 33: mammalian cell gene mutat   104
#> 34: Forward and reverse gene      4
#> 35:  Spot test, gene mutation     4
#> 36: bacterial forward mutatio     4
#> 37: sister chromatid exchange    50
#> 38: bacteriophage induction i     1
#> 39: DNA damage, gene mutation     1
#> 40: chromosome aberration stu     1
#> 41: in vitro mammalian cell t     2
#> 42: Forward gene mutation at      6
#> 43:       Cell transformation    11
#> 44: Forward and reverse gene      7
#> 45: Forward gene mutation at     10
#> 46: chromosome aberration ass    24
#> 47:     Drosophila SLRL assay    20
#> 48: Salmonella and Escherichi     4
#> 49: Cytogenetic assay in bone     1
#> 50: Chromosome aberrations in     7
#> 51:     Forward gene mutation    18
#> 52: in vitro mammalian cell g     2
#> 53:           mouse spot test     7
#> 54: Aneuploidy, whole sex chr     4
#> 55: sister chromatid exchange     7
#> 56: Mouse Lymphoma Forward Mu     1
#> 57: Tryptophan reverse gene m    18
#> 58: bacterial gene mutation a    11
#> 59: yeast forward mutation an     4
#> 60: Micronucleus test in vitr     2
#> 61: mitotic recombination ass     6
#> 62: Aneuploidy, chromosome ab     8
#> 63:       cell transformation     2
#> 64: in vitro mammalian cell m    13
#> 65: somatic mutation and reco     3
#> 66: transgenic rodent mutagen     2
#> 67:   yeast cytogenetic assay     2
#> 68: in vitro mammalian cell g     2
#> 69:        bacterial mutation     1
#> 70: bacterial reverse mutatio     5
#> 71: Aneuploidy, partial sex c     2
#> 72: in vitro chromosome aberr     1
#> 73: Cell transformation, focu     2
#> 74: Forward and reverse gene      1
#> 75: gene mutation assay in fu     5
#> 76:       Cell Transformation     1
#> 77: Aneuploidy, sex chromosom     1
#> 78: Forward and reverse gene      1
#> 79: E. coli K-12 DNA repair h     1
#> 80: Chromosomal aberration as     2
#> 81:          forward mutation     1
#> 82: mammalian cell gene mutat     1
#> 83:     Mitotic recombination     1
#>                     assayType     N
natadb_genetox_details[assayCategory == 'ND', .N, by = .(assayType)]
#>                    assayType     N
#>                       <char> <int>
#> 1:                   Overall    16
#> 2: In vivo carcinogenicity s    66
#> 3:                transgenic    18
natadb_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
#>                     assayType     N
#>                        <char> <int>
#>  1:        micronucleus assay   105
#>  2: Unscheduled DNA synthesis    27
#>  3: rodent dominant lethal as    31
#>  4:                 InVivoUDS    33
#>  5:                  InVivoCA    37
#>  6:      Dominant lethal test    14
#>  7: Unscheduled DNA synthesis     5
#>  8: In Vivo Chromosome Aberra     5
#>  9: In Vivo Mammalian Mutagen     6
#> 10:      In Vivo Micronucleus    11
#> 11:           InVivoDNADamage    23
#> 12: in vivo micronucleus (mou    51
#> 13:          Sperm morphology    25
#> 14:               DNA Binding     1
#> 15: unscheduled DNA synthesis    18
#> 16:     in vivo comet (mouse)     4
#> 17: Chromosome aberrations in     9
#> 18:       in vivo comet (rat)     3
#> 19: in vivo micronucleus (rat     9
#> 20: mammalian erythrocyte mic     2
#> 21: heritable translocation a     2
#> 22: Micronucleus and sister c     2
#> 23: bone marrow chromosome ab     1
#> 24: Chromosome aberrations, i     2
#> 25:      DNA adduct formation     1
#> 26:      DNA Covalent Binding     1
#> 27:     mammalian comet assay     6
#> 28: mammalian germ cell cytog     1
#>                     assayType     N

Answer to Environmental Health Question 3

From these initial explorations of the data, we can answer Environmental Health Question 3: After pulling the genotoxicity data for the different environmental contaminant data sets, list the assays associated with the chemicals in each data set. How many unique assays are there in each data set? What are the different assay categories and how many unique assays for each assay category are there?

Answer: There are 90 unique assays for CCl4 and 114 unique assays for NATADB. The different assay categories are “in vitro”, “ND”, and “in vivo”, with 65 unique “in vitro” assays for CCl4 and 83 for NATADB, 3 unique “ND” assays for CCL4 and 3 for NATADB, and 22 unique “in vivo” assays for CCL4 and 28 for NATADB.

Next, we dig into the results of the assays. One may be interested in looking at the number of chemicals for which an assay resulted in a positive or negative result for instance. We group by assayResult and determine the number of unique dtxsid values associated with each assayResult value.

ccl4_genetox_details[, .(DTXSIDs = length(unique(dtxsid))), by = .(assayResult)]
#>    assayResult DTXSIDs
#>         <char>   <int>
#> 1:    negative      64
#> 2:    positive      53
#> 3:   equivocal      15
natadb_genetox_details[, .(DTXSIDs = length(unique(dtxsid))), 
                       by = .(assayResult)]
#>    assayResult DTXSIDs
#>         <char>   <int>
#> 1:    negative     141
#> 2:    positive     130
#> 3:   equivocal      48

Answer to Environmental Health Question 4

With this data we may now answer Environmental Health Question 4: The genotoxicity data contains information on which assays have been conducted for different chemicals and the results of those assays. How many chemicals in each data set have a ‘positive’, ‘negative’, and ‘equivocal’ value for the assay result?

Answer: For CCL4, there are 64 unique chemicals that have a negative assay result, 53 that have a positive result, and 15 that have an equivocal result. For NATADB, there are 141 unique chemicals that have a negative assay result, 130 that have a positive result, and 48 that have an equivocal result. Observe that since there are 72 unique dtxsid values with assay results in CCL4 and 153 in NATADB, there are several chemicals that have multiple assay results.

We now determine the chemicals from each data set that are known to have genotoxic effects. For this, we look to see which chemicals produce at least one positive response in the assayResult column.

ccl4_genetox_details[, .(is_positive = any(assayResult == 'positive')), 
                     by = .(dtxsid)][is_positive == TRUE, dtxsid]
#>  [1] "DTXSID0020153" "DTXSID0020573" "DTXSID0020600" "DTXSID0020814" "DTXSID0021464" "DTXSID0021541"
#>  [7] "DTXSID0024341" "DTXSID1021407" "DTXSID1021740" "DTXSID1021798" "DTXSID1024338" "DTXSID1026164"
#> [13] "DTXSID1031040" "DTXSID2021028" "DTXSID2021317" "DTXSID2021731" "DTXSID3020203" "DTXSID3020702"
#> [19] "DTXSID3020833" "DTXSID3024869" "DTXSID3031864" "DTXSID4020533" "DTXSID4021503" "DTXSID4022361"
#> [25] "DTXSID4022367" "DTXSID5020023" "DTXSID5020576" "DTXSID5020601" "DTXSID5021207" "DTXSID5024182"
#> [31] "DTXSID5039224" "DTXSID6020301" "DTXSID6021030" "DTXSID6021032" "DTXSID6022422" "DTXSID7020005"
#> [37] "DTXSID7020215" "DTXSID7020637" "DTXSID7021029" "DTXSID8020044" "DTXSID8020090" "DTXSID8020832"
#> [43] "DTXSID8021062" "DTXSID8023846" "DTXSID8023848" "DTXSID8025541" "DTXSID8031865" "DTXSID9020243"
#> [49] "DTXSID9021390" "DTXSID9021427" "DTXSID9022366" "DTXSID9023380" "DTXSID9023914"
natadb_genetox_details[, .(is_positive = any(assayResult == 'positive')),
                       by = .(dtxsid)][is_positive == TRUE, dtxsid]
#>   [1] "DTXSID0020153" "DTXSID0020448" "DTXSID0020523" "DTXSID0020529" "DTXSID0020600"
#>   [6] "DTXSID0020868" "DTXSID0021381" "DTXSID0021383" "DTXSID0021541" "DTXSID0021834"
#>  [11] "DTXSID0021965" "DTXSID0024187" "DTXSID0039227" "DTXSID0039229" "DTXSID1020148"
#>  [16] "DTXSID1020302" "DTXSID1020306" "DTXSID1020431" "DTXSID1020512" "DTXSID1020516"
#>  [21] "DTXSID1020566" "DTXSID1021374" "DTXSID1021798" "DTXSID1021827" "DTXSID1022057"
#>  [26] "DTXSID1023786" "DTXSID1024045" "DTXSID1026164" "DTXSID1049641" "DTXSID2020137"
#>  [31] "DTXSID2020262" "DTXSID2020507" "DTXSID2020682" "DTXSID2020844" "DTXSID2021284"
#>  [36] "DTXSID2021286" "DTXSID2021319" "DTXSID2021658" "DTXSID2021731" "DTXSID2021781"
#>  [41] "DTXSID3020203" "DTXSID3020257" "DTXSID3020413" "DTXSID3020415" "DTXSID3020596"
#>  [46] "DTXSID3020679" "DTXSID3020702" "DTXSID3020833" "DTXSID3021431" "DTXSID3025091"
#>  [51] "DTXSID3039242" "DTXSID4020161" "DTXSID4020298" "DTXSID4020402" "DTXSID4020533"
#>  [56] "DTXSID4020583" "DTXSID4020874" "DTXSID4020901" "DTXSID4021006" "DTXSID4021056"
#>  [61] "DTXSID4021395" "DTXSID4039231" "DTXSID5020023" "DTXSID5020027" "DTXSID5020029"
#>  [66] "DTXSID5020071" "DTXSID5020316" "DTXSID5020449" "DTXSID5020491" "DTXSID5020601"
#>  [71] "DTXSID5020607" "DTXSID5020865" "DTXSID5021124" "DTXSID5021207" "DTXSID5021380"
#>  [76] "DTXSID5021386" "DTXSID5024055" "DTXSID5024059" "DTXSID5039224" "DTXSID6020145"
#>  [81] "DTXSID6020307" "DTXSID6020353" "DTXSID6020432" "DTXSID6020438" "DTXSID6020515"
#>  [86] "DTXSID6020569" "DTXSID6020981" "DTXSID6021828" "DTXSID6022422" "DTXSID6023947"
#>  [91] "DTXSID6023949" "DTXSID7020005" "DTXSID7020009" "DTXSID7020267" "DTXSID7020637"
#>  [96] "DTXSID7020689" "DTXSID7020710" "DTXSID7020716" "DTXSID7021029" "DTXSID7021100"
#> [101] "DTXSID7021106" "DTXSID7021318" "DTXSID7021360" "DTXSID7021368" "DTXSID7021948"
#> [106] "DTXSID7024166" "DTXSID7024370" "DTXSID7024532" "DTXSID7025180" "DTXSID7026156"
#> [111] "DTXSID8020090" "DTXSID8020173" "DTXSID8020250" "DTXSID8020599" "DTXSID8020759"
#> [116] "DTXSID8020832" "DTXSID8021195" "DTXSID8021197" "DTXSID8021432" "DTXSID8021434"
#> [121] "DTXSID8021438" "DTXSID8024286" "DTXSID9020168" "DTXSID9020243" "DTXSID9020247"
#> [126] "DTXSID9020293" "DTXSID9020827" "DTXSID9021138" "DTXSID9021261" "DTXSID9041522"

With so much genotoxicity data, let us explore this data for one chemical more deeply to get a sense of the assays and results present for it. We will explore the chemical with DTXSID0020153. We will look at the assays, the number of each type of result, and which correspond to “positive” results. To determine this, we group by assayResult and calculate .N for each group. We also isolate which were positive and output a data.table with the number of each type.

ccl4_genetox_details[dtxsid == 'DTXSID0020153', .(Number = .N), 
                     by = .(assayResult)]
#>    assayResult Number
#>         <char>  <int>
#> 1:    negative      5
#> 2:    positive     22
#> 3:   equivocal      1
ccl4_genetox_details[dtxsid == 'DTXSID0020153' & assayResult == 'positive', 
                     .(Number_of_assays = .N), by = .(assayType)][order(-Number_of_assays),]
#>                     assayType Number_of_assays
#>                        <char>            <int>
#>  1: bacterial reverse mutatio                3
#>  2:                      Ames                3
#>  3:                 InVitroCA                2
#>  4:                InVitroMLA                2
#>  5: Rec-assay, DNA effects (b                2
#>  6: Sister-chromatid exchange                2
#>  7:                   Overall                1
#>  8:                 InVitroMN                1
#>  9: Cell transformation, clon                1
#> 10: Histidine reverse gene mu                1
#> 11: Mitotic recombination or                 1
#> 12: Rec-assay, spot test, DNA                1
#> 13: Unscheduled DNA synthesis                1
#> 14: In vivo carcinogenicity s                1

Answer to Environmental Health Question 5

With these data.tables, we may answer Environmental Health Question 5: Based on the genotoxicity data reported for the chemical with DTXSID identifier DTXSID0020153, how many assays resulted in a positive/equivocal/negative value? Which of the assays were positive and how many of each were there for the most reported assays?

Answer: There were five assays that produced a negative result, 22 that produced a positive result, and one that produced an equivocal result. Of the 22 positive assays, “bacterial reverse mutation assay” and “Ames” were the most numerous, with three each.

Hazard Resource

Finally, we examine the hazard data associated with the chemicals in each data set. For each chemical, there will be potentially hundreds of rows of hazard data, so the returned results will be much larger than in most other API endpoints.

ccl4_hazard <- get_hazard_by_dtxsid_batch(DTXSID = ccl4$dtxsid)
natadb_hazard <- get_hazard_by_dtxsid_batch(DTXSID = natadb$dtxsid)

We do some preliminary exploration of the data. First we determine the dimensions of the data sets.

dim(ccl4_hazard)
#> [1] 39541    33
dim(natadb_hazard)
#> [1] 68500    33

Next we record the column names and display the first six results in the CCL4 hazard data.

colnames(ccl4_hazard)
#>  [1] "id"                     "source"                 "year"                  
#>  [4] "dtxsid"                 "exposureRoute"          "toxvalNumeric"         
#>  [7] "toxvalNumericQualifier" "toxvalUnits"            "studyType"             
#> [10] "studyDurationClass"     "studyDuractionValue"    "studyDurationUnits"    
#> [13] "strain"                 "sex"                    "population"            
#> [16] "exposureMethod"         "exposureForm"           "media"                 
#> [19] "lifestage"              "generation"             "criticalEffect"        
#> [22] "detailText"             "supercategory"          "speciesCommon"         
#> [25] "humanEcoNt"             "priorityId"             "subsource"             
#> [28] "sourceUrl"              "subsourceUrl"           "riskAssessmentClass"   
#> [31] "toxvalType"             "toxvalSubtype"          "studyDurationValue"
head(ccl4_hazard)
#>        id source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>     <int> <char> <char>        <char>        <char>         <num>                 <char>
#> 1: 150792 COSMOS   1978 DTXSID7020215             -      18.02470                      =
#> 2:  11820 COSMOS   1986 DTXSID5039224             -       7.48901                      =
#> 3:  11603 COSMOS   1980 DTXSID5039224             -     440.53000                      =
#> 4: 150947 COSMOS   1980 DTXSID7020215             -       0.75000                      =
#> 5: 150950 COSMOS   1978 DTXSID7020215             -       0.75000                      =
#> 6: 150945 COSMOS   1979 DTXSID7020215             -   90123.50000                      =
#> 26 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]

We determine the number of unique values in the criticalEffect, supercategory, and toxvalType columns for each data set.

The number of unique values for criticalEffect.

length(ccl4_hazard[, unique(criticalEffect)])
#> [1] 3252
length(natadb_hazard[, unique(criticalEffect)])
#> [1] 2802

The number of unique values of supercategory.

length(ccl4_hazard[, unique(supercategory)])
#> [1] 12
length(natadb_hazard[, unique(supercategory)])
#> [1] 12

The number of unique values for toxvalType.

length(ccl4_hazard[, unique(toxvalType)])
#> [1] 207
length(natadb_hazard[, unique(toxvalType)])
#> [1] 463

Now we look at the number of entries per supercategory.

ccl4_hazard[, .N, by = .(supercategory)]
#>                supercategory     N
#>                       <char> <int>
#>  1:       Point of Departure 23011
#>  2:          Screening Level   564
#>  3:     Effect Concentration  3098
#>  4:             Effect Level   743
#>  5:   Lethality Effect Level  8119
#>  6:                    other  1084
#>  7: Inhibition Concentration   356
#>  8:  Effective Residue Level    14
#>  9:              Effect Dose     5
#> 10:              Effect Time   250
#> 11:           Exposure Limit   648
#> 12:           Toxicity Value  1649
natadb_hazard[, .N, by = .(supercategory)]
#>                supercategory     N
#>                       <char> <int>
#>  1:       Point of Departure 22524
#>  2:          Screening Level  2012
#>  3:   Lethality Effect Level 24536
#>  4:              Effect Dose    19
#>  5:     Effect Concentration  6538
#>  6:             Effect Level  1393
#>  7:           Exposure Limit  2271
#>  8: Inhibition Concentration   556
#>  9:                    other  2760
#> 10:  Effective Residue Level    18
#> 11:              Effect Time  1043
#> 12:           Toxicity Value  4830

With over 22,000 results for the supercategory value “Point of Departure” for each data set, we dig into this further.

We determine the number of rows grouped by toxvalType that have the “Point of Departure” supercategory value, and display this descending.

ccl4_hazard[ supercategory %in% 'Point of Departure', .N, 
             by = .(toxvalType)][order(-N),]
#>                    toxvalType     N
#>                        <char> <int>
#>  1:                      NOEC  8295
#>  2:                      LOEC  6092
#>  3:                     NOAEL  1858
#>  4:                       LEL  1519
#>  5:                     LOAEL  1484
#>  6:                      NOEL  1211
#>  7:                       NEL   842
#>  8:                      LOEL   688
#>  9:                     NOAEC   365
#> 10:                     LOAEC   180
#> 11:                      HNEL   130
#> 12:                      BMDL   101
#> 13:                       BMD    71
#> 14:                 BMDL (10)    69
#> 15:                       LEC    30
#> 16:                      BMCL    26
#> 17:                 BMDL (05)    15
#> 18:                 BMCL (10)     9
#> 19:   POD (screening chronic)     6
#> 20: POD (screening subchronic     5
#> 21:             BMCL (10 HEC)     3
#> 22:                  BMC (05)     2
#> 23:                       NEC     2
#> 24:                BMDL (1SD)     2
#> 25:                BMCL (HEC)     2
#> 26:                  BMC (20)     1
#> 27:                BMDL (ADJ)     1
#> 28:             BMDL (10 HED)     1
#> 29:             BMDL (05 HED)     1
#>                    toxvalType     N
natadb_hazard[ supercategory %in% 'Point of Departure', .N, 
               by = .(toxvalType)][order(-N),]
#>                    toxvalType     N
#>                        <char> <int>
#>  1:                      NOEC  7329
#>  2:                      LOEC  4412
#>  3:                     NOAEL  3150
#>  4:                     LOAEL  1778
#>  5:                     NOAEC  1037
#>  6:                       LEL  1031
#>  7:                      NOEL   934
#>  8:                       NEL   668
#>  9:                      LOEL   638
#> 10:                     LOAEC   492
#> 11:                      BMDL   360
#> 12:                       BMD   245
#> 13:                      HNEL   219
#> 14:                 BMDL (10)    51
#> 15:                      BMCL    50
#> 16:                       LEC    43
#> 17:             BMCL (10 HEC)    13
#> 18: POD (screening subchronic    11
#> 19:             BMDL (10 HED)    10
#> 20:                 BMCL (10)     9
#> 21:   POD (screening chronic)     9
#> 22:                       BMC     7
#> 23:                BMDL (1SD)     6
#> 24:                 BMDL (05)     5
#> 25:                  BMC (20)     3
#> 26:              BMC (10 HEC)     3
#> 27:               LOAEL (HED)     3
#> 28:                       NEC     2
#> 29:                BMDL (HED)     2
#> 30:             BMDL (05 HEC)     1
#> 31:              BMD (2x-ADJ)     1
#> 32:                BMCL (HEC)     1
#> 33:        BMCL (Adjusted HE)     1
#>                    toxvalType     N

We explore “NOEC” and “LOEC” further. Let us look at the the case when media value is either “salt water” or “fresh water”. For this, we will recover the minimum value of “NOEC” and “LOEC” for each chemical in each data set.

First, we look at soil. We order by toxvalType and by the minimum toxvalNumeric value in each group, descending.

ccl4_hazard[media %in% 'salt water' & toxvalType %in% c('LOEC', 'NOEC'), 
            .(toxvalNumeric = min(toxvalNumeric)), 
            by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
                                                           -toxvalNumeric)]
#>     toxvalType toxvalUnits        dtxsid toxvalNumeric
#>         <char>      <char>        <char>         <num>
#>  1:       LOEC       mg/m3 DTXSID2021731   1.18556e+07
#>  2:       LOEC       mg/m3 DTXSID8020597   5.00000e+06
#>  3:       LOEC       mg/m3 DTXSID4022991   1.51000e+04
#>  4:       LOEC       mg/m3 DTXSID1021740   1.00000e+04
#>  5:       LOEC       mg/m3 DTXSID9024142   1.90000e+03
#>  6:       LOEC       mg/m3 DTXSID8023846   1.40000e+03
#>  7:       LOEC       mg/m3 DTXSID1024207   1.00000e+03
#>  8:       LOEC       mL/m3 DTXSID7020637   2.00000e+02
#>  9:       LOEC   mg/kg-day DTXSID8031865   1.00000e+02
#> 10:       LOEC       mg/m3 DTXSID3074313   9.37000e+01
#> 11:       LOEC         ppm DTXSID7020637   5.50000e+01
#> 12:       LOEC       mg/m3 DTXSID9032113   4.30000e+01
#> 13:       LOEC       mg/m3 DTXSID4034948   2.20000e+01
#> 14:       LOEC       mg/m3 DTXSID4022448   1.50000e+01
#> 15:       LOEC       % v/v DTXSID8020597   1.00000e+01
#> 16:       LOEC       mg/m3 DTXSID5020023   8.00000e+00
#> 17:       LOEC       mg/m3 DTXSID9023914   6.00000e+00
#> 18:       LOEC       mg/m3 DTXSID7020637   5.00000e+00
#> 19:       LOEC       % v/v DTXSID2021731   5.00000e+00
#> 20:       LOEC           % DTXSID2021731   4.00000e+00
#> 21:       LOEC       mg/m3 DTXSID3021857   1.00000e+00
#> 22:       LOEC       mg/m3 DTXSID6024177   1.00000e+00
#> 23:       LOEC   mg/kg-day DTXSID5020576   1.00000e+00
#> 24:       LOEC       mg/m3 DTXSID2040282   1.00000e+00
#> 25:       LOEC       mg/m3 DTXSID4032611   6.20000e-01
#> 26:       LOEC       mg/m3 DTXSID4022367   4.84000e-01
#> 27:       LOEC       mg/m3 DTXSID1024174   3.40000e-01
#> 28:       LOEC       mg/m3 DTXSID3032464   3.00000e-01
#> 29:       LOEC        nM/g DTXSID0020573   3.00000e-01
#> 30:       LOEC       mg/m3 DTXSID3031864   1.00000e-01
#> 31:       LOEC       mg/m3 DTXSID1021409   9.00000e-02
#> 32:       LOEC       mg/m3 DTXSID8031865   8.10000e-02
#> 33:       LOEC   mg/kg-day DTXSID3021857   5.00000e-02
#> 34:       LOEC       mg/m3 DTXSID8022292   5.00000e-02
#> 35:       LOEC       mg/m3 DTXSID4022361   3.00000e-02
#> 36:       LOEC        mL/L DTXSID8020597   2.60000e-02
#> 37:       LOEC       mg/m3 DTXSID0020446   1.00000e-02
#> 38:       LOEC   mg/kg-day DTXSID0020573   3.50000e-03
#> 39:       LOEC       mg/m3 DTXSID0020573   3.00000e-03
#> 40:       LOEC        ml/g DTXSID7020637   1.00000e-03
#> 41:       LOEC       mg/m3 DTXSID5020576   1.00000e-03
#> 42:       LOEC   mg/kg-day DTXSID3032464   3.50000e-04
#> 43:       LOEC           L DTXSID0020573   2.00000e-04
#> 44:       NOEC       mg/m3 DTXSID1024207   1.00000e+05
#> 45:       NOEC       mg/m3 DTXSID2040282   3.00000e+04
#> 46:       NOEC       mg/m3 DTXSID1021740   1.00000e+04
#> 47:       NOEC       mg/m3 DTXSID4022991   4.90000e+03
#> 48:       NOEC       mg/m3 DTXSID9024142   1.90000e+03
#> 49:       NOEC       mg/m3 DTXSID8025541   1.00000e+03
#> 50:       NOEC       mg/m3 DTXSID8023846   5.80000e+02
#> 51:       NOEC       mg/m3 DTXSID3020964   3.60000e+02
#> 52:       NOEC       mL/m3 DTXSID7020637   2.00000e+02
#> 53:       NOEC       mg/m3 DTXSID2021731   1.00000e+02
#> 54:       NOEC   mg/kg-day DTXSID8031865   1.00000e+02
#> 55:       NOEC       mg/m3 DTXSID3074313   9.37000e+01
#> 56:       NOEC   mg/kg-day DTXSID3021857   2.50000e+01
#> 57:       NOEC       mg/m3 DTXSID4034948   2.20000e+01
#> 58:       NOEC         ppm DTXSID7020637   2.00000e+01
#> 59:       NOEC       mg/m3 DTXSID9032113   1.90000e+01
#> 60:       NOEC       mg/m3 DTXSID6024177   1.00000e+01
#> 61:       NOEC       mg/m3 DTXSID5020023   8.00000e+00
#> 62:       NOEC       mg/m3 DTXSID7020637   5.00000e+00
#> 63:       NOEC       % v/v DTXSID8020597   5.00000e+00
#> 64:       NOEC       mg/m3 DTXSID9020243   3.10000e+00
#> 65:       NOEC       mg/m3 DTXSID9023914   3.00000e+00
#> 66:       NOEC   mg/kg-day DTXSID4022367   2.50000e+00
#> 67:       NOEC       mg/m3 DTXSID4022448   2.00000e+00
#> 68:       NOEC       % sat DTXSID7020637   2.00000e+00
#> 69:       NOEC   mg/kg-day DTXSID5020576   1.00000e+00
#> 70:       NOEC        nM/g DTXSID0020573   5.00000e-01
#> 71:       NOEC       mg/m3 DTXSID4032611   3.60000e-01
#> 72:       NOEC       mg/m3 DTXSID1024174   3.40000e-01
#> 73:       NOEC       mg/m3 DTXSID4022361   3.00000e-01
#> 74:       NOEC       mg/m3 DTXSID0021464   2.65000e-01
#> 75:       NOEC       mg/m3 DTXSID3021857   1.80000e-01
#> 76:       NOEC       mg/m3 DTXSID4022367   1.00000e-01
#> 77:       NOEC       mg/m3 DTXSID3031864   1.00000e-01
#> 78:       NOEC       mg/m3 DTXSID3032464   1.00000e-01
#> 79:       NOEC       mg/m3 DTXSID1021409   4.70000e-02
#> 80:       NOEC       mg/m3 DTXSID8022292   2.50000e-02
#> 81:       NOEC       % v/v DTXSID2021731   2.00000e-02
#> 82:       NOEC       mg/m3 DTXSID8031865   1.00000e-02
#> 83:       NOEC           % DTXSID2021731   1.00000e-02
#> 84:       NOEC       mg/m3 DTXSID0020446   2.00000e-03
#> 85:       NOEC       mg/m3 DTXSID0020573   1.00000e-03
#> 86:       NOEC        ml/g DTXSID7020637   1.00000e-03
#> 87:       NOEC         ppm DTXSID3020702   5.00000e-04
#> 88:       NOEC   mg/kg-day DTXSID3032464   2.20000e-04
#> 89:       NOEC           L DTXSID0020573   2.00000e-04
#> 90:       NOEC       mg/m3 DTXSID5020576   1.00000e-04
#> 91:       NOEC   mg/kg-day DTXSID0020573   4.00000e-05
#>     toxvalType toxvalUnits        dtxsid toxvalNumeric
natadb_hazard[media %in% 'salt water' & toxvalType %in% c('LOEC', 'NOEC'), 
              .(toxvalNumeric = min(toxvalNumeric)), 
              by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
                                                             -toxvalNumeric)]
#>      toxvalType toxvalUnits        dtxsid toxvalNumeric
#>          <char>      <char>        <char>         <num>
#>   1:       LOEC       mg/m3 DTXSID2021731   1.18556e+07
#>   2:       LOEC       mg/m3 DTXSID8020597   5.00000e+06
#>   3:       LOEC       mg/m3 DTXSID7026156   5.08300e+05
#>   4:       LOEC       mg/m3 DTXSID8020759   1.56000e+05
#>   5:       LOEC       mg/m3 DTXSID3020415   2.50000e+04
#>  ---                                                   
#> 130:       NOEC       mg/m3 DTXSID9020827   5.00000e-03
#> 131:       NOEC       mg/m3 DTXSID7021100   3.10000e-03
#> 132:       NOEC        ml/g DTXSID7020637   1.00000e-03
#> 133:       NOEC         ppm DTXSID3020702   5.00000e-04
#> 134:       NOEC       mg/m3 DTXSID5020607   1.00000e-04

Next we look at fresh water, repeating the same grouping and ordering as in the previous case.

ccl4_hazard[media %in% 'fresh water' & toxvalType %in% c('LOEC', 'NOEC'), 
            .(toxvalNumeric = min(toxvalNumeric)), 
            by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
                                                           -toxvalNumeric)]
#>      toxvalType toxvalUnits          dtxsid toxvalNumeric
#>          <char>      <char>          <char>         <num>
#>   1:       LOEC       mg/m3   DTXSID7020005   6.20000e+06
#>   2:       LOEC       mg/m3   DTXSID4020533   5.75000e+05
#>   3:       LOEC       mg/m3   DTXSID5024182   5.00000e+05
#>   4:       LOEC       mg/m3   DTXSID1020437   5.00000e+05
#>   5:       LOEC       mg/m3   DTXSID3020833   4.40748e+05
#>  ---                                                     
#> 185:       NOEC       % vol   DTXSID2021731   4.00000e-04
#> 186:       NOEC       mg/m3   DTXSID5020576   2.96409e-05
#> 187:       NOEC       mg/m3   DTXSID0020573   2.70000e-05
#> 188:       NOEC           L   DTXSID2021731   1.00000e-06
#> 189:       NOEC   mg/kg-day DTXSID001024118   6.00000e-08
natadb_hazard[media %in% 'fresh water' & toxvalType %in% c('LOEC', 'NOEC'), 
              .(toxvalNumeric = min(toxvalNumeric)), 
              by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
                                                             -toxvalNumeric)]
#>      toxvalType toxvalUnits        dtxsid toxvalNumeric
#>          <char>      <char>        <char>         <num>
#>   1:       LOEC       mg/m3 DTXSID7020005     6.200e+06
#>   2:       LOEC       mg/m3 DTXSID6020515     3.000e+06
#>   3:       LOEC       mg/m3 DTXSID6020981     6.125e+05
#>   4:       LOEC       mg/m3 DTXSID4020533     5.750e+05
#>   5:       LOEC       mg/m3 DTXSID1020437     5.000e+05
#>  ---                                                   
#> 284:       NOEC           % DTXSID2021731     5.000e-04
#> 285:       NOEC       % vol DTXSID2021731     4.000e-04
#> 286:       NOEC           L DTXSID0021383     2.340e-04
#> 287:       NOEC           L DTXSID2021319     2.300e-05
#> 288:       NOEC           L DTXSID2021731     1.000e-06

Now, let us restrict our attention to human hazard and focus on the exposure routes given by inhalation and oral.

First, let us determine the exposure routes in general.

ccl4_hazard[humanEcoNt %in% 'human health', unique(exposureRoute)]
#>  [1] "oral"               "dermal"             "inhalation"         "radiation"         
#>  [5] "-"                  "soil"               "in vitro"           "intraperitoneal"   
#>  [9] "inhalation, dermal" "subcutaneous"       "intramuscular"      "injection"         
#> [13] "implant"            "minipump"           "intraveneous"
natadb_hazard[humanEcoNt %in% 'human health', unique(exposureRoute)]
#>  [1] "dermal"                        "inhalation"                    "oral"                         
#>  [4] "-"                             "soil"                          "tdermal"                      
#>  [7] "injection"                     "subcutaneous"                  "intravenous"                  
#> [10] "intraperitoneal, subcutaneous" "inhalation, dermal"            "other"

Then, let’s focus on the inhalation and oral exposure routes for human hazard.

To answer this, filter the data into the corresponding exposure routes, then group by exposureRoute and riskAssessmentClass, and finally count the number of instances for each grouping. To determine the most represented class, one can order the results descending.

ccl4_hazard[humanEcoNt %in% 'human health' & 
              exposureRoute %in% c('inhalation', 'oral'), .(Hits = .N), 
            by = .(exposureRoute, riskAssessmentClass)][order(exposureRoute, 
                                                              -Hits)]
#>     exposureRoute       riskAssessmentClass  Hits
#>            <char>                    <char> <int>
#>  1:    inhalation                     acute   705
#>  2:    inhalation      air quality standard   580
#>  3:    inhalation                   chronic   489
#>  4:    inhalation                subchronic   321
#>  5:    inhalation                short-term   236
#>  6:    inhalation              reproduction   110
#>  7:    inhalation    water quality standard    90
#>  8:    inhalation             developmental    41
#>  9:    inhalation             neurotoxicity    34
#> 10:    inhalation         repeat dose other    32
#> 11:    inhalation            immunotoxicity    14
#> 12:    inhalation                  clinical     4
#> 13:          oral                   chronic  2332
#> 14:          oral                subchronic  1081
#> 15:          oral              reproduction   840
#> 16:          oral                     acute   759
#> 17:          oral                short-term   512
#> 18:          oral             developmental   451
#> 19:          oral    water quality standard   332
#> 20:          oral              uterotrophic    78
#> 21:          oral   drinking water standard    62
#> 22:          oral             neurotoxicity    55
#> 23:          oral               Hershberger    38
#> 24:          oral            exposure limit    20
#> 25:          oral         repeat dose other    16
#> 26:          oral  special toxicology study    13
#> 27:          oral                  clinical     8
#> 28:          oral  neurotoxicity subchronic     4
#> 29:          oral            immunotoxicity     4
#> 30:          oral            dose selection     3
#> 31:          oral reproduction developmenta     3
#> 32:          oral                     human     2
#> 33:          oral  neurotoxicity short-term     1
#> 34:          oral       neurotoxicity acute     1
#>     exposureRoute       riskAssessmentClass  Hits
natadb_hazard[humanEcoNt %in% 'human health' & 
                exposureRoute %in% c('inhalation', 'oral'), .(Hits = .N), 
              by = .(exposureRoute, riskAssessmentClass)][order(exposureRoute,
                                                                -Hits)]
#>     exposureRoute       riskAssessmentClass  Hits
#>            <char>                    <char> <int>
#>  1:    inhalation                     acute  2125
#>  2:    inhalation      air quality standard  1865
#>  3:    inhalation                   chronic  1665
#>  4:    inhalation                subchronic   738
#>  5:    inhalation                short-term   481
#>  6:    inhalation    water quality standard   177
#>  7:    inhalation              reproduction   150
#>  8:    inhalation         repeat dose other    83
#>  9:    inhalation             developmental    74
#> 10:    inhalation             neurotoxicity    73
#> 11:    inhalation            immunotoxicity    23
#> 12:    inhalation                  clinical    12
#> 13:    inhalation reproduction developmenta     5
#> 14:    inhalation               repeat dose     2
#> 15:          oral                   chronic  3705
#> 16:          oral                     acute  1746
#> 17:          oral                subchronic  1512
#> 18:          oral    water quality standard  1043
#> 19:          oral                short-term   598
#> 20:          oral             developmental   289
#> 21:          oral   drinking water standard   226
#> 22:          oral              reproduction   199
#> 23:          oral         repeat dose other    49
#> 24:          oral             neurotoxicity    41
#> 25:          oral              uterotrophic    27
#> 26:          oral  special toxicology study    20
#> 27:          oral            exposure limit    12
#> 28:          oral            dose selection    11
#> 29:          oral               Hershberger    11
#> 30:          oral                  clinical     8
#> 31:          oral  neurotoxicity subchronic     6
#> 32:          oral            immunotoxicity     6
#> 33:          oral  neurotoxicity short-term     5
#> 34:          oral reproduction developmenta     4
#> 35:          oral       neurotoxicity acute     2
#> 36:          oral                 mortality     2
#> 37:          oral               repeat dose     2
#> 38:          oral      air quality standard     1
#>     exposureRoute       riskAssessmentClass  Hits

Answer to Environmental Health Question 6

With these results we may answer Environmental Health Question 6: After pulling the hazard data for the different data sets, list the different exposure routes for which there is data. What are the unique risk assessment classes for hazard values for the oral route and for the inhalation exposure route? For each such exposure route, which risk assessment class is most represented by the data sets?

Answer: We listed the general exposure routes above for the hazard data associated with the chemicals in each data set. Restricting our attention to human hazard data, the “acute” riskAssessmentClass is most represented by the inhalation exposure route and “chronic” for the oral exposure route for both the CCL4 and NATADB data sets.

We now drill down a little further before moving into a different path for data exploration. We explore the different types of toxicity values present in each data set for the inhalation and oral exposure routes, and then see which of these are common to both exposure routes for each data set.

To answer this, we filter the rows to the “human health” humanEcoNT value and “inhalation” or “oral” exposureRoute value. Then we return the unique values that toxvalType takes.

First we look at CCL4.

ccl4_hazard[humanEcoNt %in% 'human health' &
              exposureRoute %in% c('inhalation'), unique(toxvalType)]
#>  [1] "cancer unit risk"               "RfC"                           
#>  [3] "RfD"                            "cancer slope factor"           
#>  [5] "NOAEL"                          "MRL"                           
#>  [7] "OEHHA MADL"                     "REL"                           
#>  [9] "OEHHA NSRL"                     "LC50"                          
#> [11] "MEG"                            "PAC-2"                         
#> [13] "PAC-1"                          "PAC-3"                         
#> [15] "-"                              "NOAEC"                         
#> [17] "LOAEC"                          "NOEC"                          
#> [19] "LOEL"                           "LC100"                         
#> [21] "LC0"                            "LCLo"                          
#> [23] "TClo"                           "LEC"                           
#> [25] "LOAEL"                          "LC"                            
#> [27] "BMCL (10)"                      "NEL"                           
#> [29] "NOEL"                           "LT50"                          
#> [31] "IC50"                           "LOEC"                          
#> [33] "RD50"                           "LEL"                           
#> [35] "MT0"                            "LD"                            
#> [37] "AEGL"                           "tolerable concentration in air"
#> [39] "BMDL (1SD)"                     "BMCL (HEC)"                    
#> [41] "BMCL (10 HEC)"                  "IDLH"                          
#> [43] "air contaminant limit"          "medium spec. conc."            
#> [45] "POD (screening chronic)"        "POD (screening subchronic)"    
#> [47] "BMDL"                           "BMCL"                          
#> [49] "SRfCi"                          "BMD"
ccl4_hazard[humanEcoNt %in% 'human health' &
              exposureRoute %in% c('oral'), unique(toxvalType)]
#>  [1] "cancer slope factor"        "RfD"                        "NOAEL"                     
#>  [4] "MRL"                        "LOAEL"                      "MCL California"            
#>  [7] "OEHHA PHG"                  "OEHHA MADL"                 "MCL"                       
#> [10] "REL"                        "LD50"                       "BMDL"                      
#> [13] "LEL"                        "HNEL"                       "MEG"                       
#> [16] "LD0"                        "-"                          "NOEL"                      
#> [19] "LOEL"                       "LD100"                      "ALD"                       
#> [22] "LEC"                        "BMDL (10)"                  "ND50"                      
#> [25] "TClo"                       "UL"                         "TDI"                       
#> [28] "BMDL (ADJ)"                 "BMDL (05)"                  "BMDL (10 HED)"             
#> [31] "BMDL (1SD)"                 "BMDL (05 HED)"              "health advisory"           
#> [34] "DWEL"                       "medium spec. conc."         "POD (screening subchronic)"
#> [37] "POD (screening chronic)"    "SRfDo"                      "NEL"                       
#> [40] "HBSL"                       "HHBP"                       "BMD"
intersect(ccl4_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'inhalation', unique(toxvalType)], ccl4_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral', unique(toxvalType)])
#>  [1] "RfD"                        "cancer slope factor"        "NOAEL"                     
#>  [4] "MRL"                        "OEHHA MADL"                 "REL"                       
#>  [7] "MEG"                        "-"                          "LOEL"                      
#> [10] "TClo"                       "LEC"                        "LOAEL"                     
#> [13] "NEL"                        "NOEL"                       "LEL"                       
#> [16] "BMDL (1SD)"                 "medium spec. conc."         "POD (screening chronic)"   
#> [19] "POD (screening subchronic)" "BMDL"                       "BMD"

Then we look at NATADB.

natadb_hazard[humanEcoNt %in% 'human health' & 
                exposureRoute %in% c('inhalation'), unique(toxvalType)]
#>  [1] "cancer unit risk"                          "cancer slope factor"                      
#>  [3] "RfD"                                       "RfC"                                      
#>  [5] "NOAEL"                                     "MRL"                                      
#>  [7] "OEHHA MADL"                                "REL"                                      
#>  [9] "OEHHA NSRL"                                "LC50"                                     
#> [11] "MEG"                                       "PAC-2"                                    
#> [13] "PAC-3"                                     "PAC-1"                                    
#> [15] "LOAEC"                                     "LOEC"                                     
#> [17] "NOEC"                                      "NOAEC"                                    
#> [19] "NOEL"                                      "-"                                        
#> [21] "LOEL"                                      "LC0"                                      
#> [23] "LC100"                                     "TClo"                                     
#> [25] "LOAEL"                                     "LCLo"                                     
#> [27] "LT50"                                      "LEL"                                      
#> [29] "TDLo"                                      "LC95"                                     
#> [31] "ID50"                                      "LEC"                                      
#> [33] "MT0"                                       "T25"                                      
#> [35] "IC50"                                      "LT0"                                      
#> [37] "NEL"                                       "RD50"                                     
#> [39] "EC50"                                      "BMDL (10)"                                
#> [41] "LC80"                                      "LD50"                                     
#> [43] "LC10"                                      "LD"                                       
#> [45] "LD100"                                     "LD0"                                      
#> [47] "LC"                                        "LD67"                                     
#> [49] "LC30"                                      "BMC"                                      
#> [51] "LT100"                                     "SCE50"                                    
#> [53] "LD75"                                      "RD10"                                     
#> [55] "LT"                                        "LC60"                                     
#> [57] "ALC"                                       "AEGL"                                     
#> [59] "tolerable concentration in air"            "BMDL (10 HED)"                            
#> [61] "BMCL (10 HEC)"                             "BMC (10 HEC)"                             
#> [63] "BMDL (05 HEC)"                             "BMDL (HED)"                               
#> [65] "BMDL (1SD)"                                "BMCL"                                     
#> [67] "BMCL (HEC)"                                "IDLH"                                     
#> [69] "air contaminant limit"                     "medium spec. conc."                       
#> [71] "POD (screening chronic)"                   "cancer unit risk (provisional, screening)"
#> [73] "POD (screening subchronic)"                "BMCL (Adjusted HE)"                       
#> [75] "BMDL"                                      "LOAEL (HED)"                              
#> [77] "BMCL (10)"                                 "SRfCi"                                    
#> [79] "BMD"
natadb_hazard[humanEcoNt %in% 'human health' & 
                exposureRoute %in% c('oral'), unique(toxvalType)]
#>  [1] "RfD"                            "cancer slope factor"           
#>  [3] "NOAEL"                          "MRL"                           
#>  [5] "MCL California"                 "MCL Federal"                   
#>  [7] "OEHHA PHG"                      "MCL"                           
#>  [9] "REL"                            "OEHHA MADL"                    
#> [11] "OEHHA NSRL"                     "LD50"                          
#> [13] "BMDL"                           "LEL"                           
#> [15] "HNEL"                           "LOAEL"                         
#> [17] "MEG"                            "NOEL"                          
#> [19] "NOAEC"                          "LD100"                         
#> [21] "LEC"                            "LOEL"                          
#> [23] "-"                              "ALD"                           
#> [25] "LD0"                            "TClo"                          
#> [27] "LC50"                           "LOAEC"                         
#> [29] "TDLo"                           "LD37"                          
#> [31] "% mortality"                    "T25"                           
#> [33] "NEL"                            "BMDL (10)"                     
#> [35] "LT50"                           "UL"                            
#> [37] "TDI"                            "tolerable concentration in air"
#> [39] "ADI"                            "BMDL (1SD)"                    
#> [41] "BMD (2x-ADJ)"                   "BMDL (HED)"                    
#> [43] "BMDL (10 HED)"                  "BMDL (05)"                     
#> [45] "DWEL"                           "health advisory"               
#> [47] "medium spec. conc."             "POD (screening subchronic)"    
#> [49] "POD (screening chronic)"        "BMC"                           
#> [51] "SRfDo"                          "HBSL"                          
#> [53] "HHBP"                           "BMD"
intersect(natadb_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'inhalation', unique(toxvalType)], natadb_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral', unique(toxvalType)])
#>  [1] "cancer slope factor"            "RfD"                           
#>  [3] "NOAEL"                          "MRL"                           
#>  [5] "OEHHA MADL"                     "REL"                           
#>  [7] "OEHHA NSRL"                     "LC50"                          
#>  [9] "MEG"                            "LOAEC"                         
#> [11] "NOAEC"                          "NOEL"                          
#> [13] "-"                              "LOEL"                          
#> [15] "TClo"                           "LOAEL"                         
#> [17] "LT50"                           "LEL"                           
#> [19] "TDLo"                           "LEC"                           
#> [21] "T25"                            "NEL"                           
#> [23] "BMDL (10)"                      "LD50"                          
#> [25] "LD100"                          "LD0"                           
#> [27] "BMC"                            "tolerable concentration in air"
#> [29] "BMDL (10 HED)"                  "BMDL (HED)"                    
#> [31] "BMDL (1SD)"                     "medium spec. conc."            
#> [33] "POD (screening chronic)"        "POD (screening subchronic)"    
#> [35] "BMDL"                           "BMD"

Answer to Environmental Health Question 7

With the results above, we may answer Environmental Health Question 7: There are several types of toxicity values for each exposure route. List the unique toxicity values for the oral and inhalation routes. What are the unique types of toxicity values for the oral route and for the inhalation route? How many of these are common to both the oral and inhalation routes for each data set?

Answer: There are 21 toxicity value types shared between the oral and inhalation exposure routes for CCL4 and 36 for NATADB. The lists above indicate the variety of toxicity values present in the hazard data for the two different exposure routes we have considered.

For the next data exploration, We will turn to the riskAssessmentClass value of “developmental”. We will examine the “NOAEL” and “LOAEL” values for chemicals with oral exposure, human hazard, and a riskAssessmentClass value of “developmental”. We also examine the units to determine whether any unit conversions are necessary to compare numeric values.

ccl4_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral' &
              riskAssessmentClass %in% 'developmental' & 
              toxvalType %in% c('NOAEL', 'LOAEL'), ]
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1:  69722 ATSDR PFAS   2010 DTXSID8031865          oral           5.0                      =
#>   2:  70774 ATSDR PFAS   2009 DTXSID8031865          oral           8.7                      =
#>   3:  71567 ATSDR PFAS   1982 DTXSID8031865          oral          50.0                      =
#>   4:  72842 ATSDR PFAS   2011 DTXSID8031865          oral           0.3                      =
#>   5:  75135 ATSDR PFAS   2010 DTXSID8031865          oral           5.0                      =
#>  ---                                                                                          
#> 218: 293837   ToxRefDB   1985 DTXSID9032329          oral          80.0                      =
#> 219: 124993   ToxRefDB   1996 DTXSID6024177          oral           0.2                      =
#> 220: 166218   ToxRefDB   1979 DTXSID0032578          oral         100.0                     >=
#> 221: 194758   ToxRefDB   1995 DTXSID4022448          oral         100.0                      =
#> 222: 195159   ToxRefDB   1985 DTXSID4022448          oral        1000.0                      =
#> 26 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
ccl4_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral' &
              riskAssessmentClass %in% 'developmental' &
              toxvalType %in% c('NOAEL', 'LOAEL'), unique(toxvalUnits)]
#> [1] "mg/kg-day" "mg/kg"     "-"
natadb_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral' &
                riskAssessmentClass %in% 'developmental' &
                toxvalType %in% c('NOAEL', 'LOAEL'), ]
#>          id   source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>   <char> <char>        <char>        <char>         <num>                 <char>
#>   1:  86910    EFSA2      - DTXSID8020913          oral         400.0                      =
#>   2:  86732    EFSA2      - DTXSID8020913          oral         120.0                      =
#>   3:  87341    EFSA2      - DTXSID8020913          oral          50.0                      =
#>   4: 149371    HPVIS      - DTXSID4020161          oral         500.0                      =
#>   5:  69436    HPVIS      - DTXSID0021834          oral          27.6                      =
#>  ---                                                                                        
#> 138: 330469 ToxRefDB   1984 DTXSID4021395          oral         500.0                     >=
#> 139: 330750 ToxRefDB   1984 DTXSID4021395          oral         100.0                      =
#> 140: 229449 ToxRefDB   1993 DTXSID2021781          oral         630.0                      =
#> 141: 330299 ToxRefDB   1983 DTXSID4021395          oral         500.0                     >=
#> 142: 330226 ToxRefDB   1983 DTXSID4021395          oral         500.0                      =
#> 26 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
natadb_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral' &
                riskAssessmentClass %in% 'developmental' & 
                toxvalType %in% c('NOAEL', 'LOAEL'), unique(toxvalUnits)]
#> [1] "mg/kg-day" "mg/kg"     "-"

Observe that for both CCL4 and NATADB, the units are given by “mg/kg-day”, “mg/kg” and “-”. In this case, we treat “mg/kg-day” and “mg/kg” the same and exclude “-”. We group by DTXSID to find the lowest or highest value.

ccl4_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral' &
            riskAssessmentClass %in% 'developmental' & 
            toxvalType %in% c('NOAEL', 'LOAEL') & (toxvalUnits != '-'),
            .(numeric_value = min(toxvalNumeric), 
            units = toxvalUnits[[which.min(toxvalNumeric)]]), 
            by = .(dtxsid, toxvalType)]
#>            dtxsid toxvalType numeric_value     units
#>            <char>     <char>         <num>    <char>
#>  1: DTXSID8031865      LOAEL         0.300 mg/kg-day
#>  2: DTXSID8031865      NOAEL         1.000 mg/kg-day
#>  3: DTXSID3031864      LOAEL         0.800 mg/kg-day
#>  4: DTXSID3031864      NOAEL         0.100 mg/kg-day
#>  5: DTXSID1024174      NOAEL         1.000     mg/kg
#>  6: DTXSID7021029      LOAEL         0.025 mg/kg-day
#>  7: DTXSID8023846      NOAEL         3.000 mg/kg-day
#>  8: DTXSID3032464      NOAEL        30.000 mg/kg-day
#>  9: DTXSID8022292      LOAEL       150.000 mg/kg-day
#> 10: DTXSID8022292      NOAEL        50.000 mg/kg-day
#> 11: DTXSID9020243      NOAEL        10.000 mg/kg-day
#> 12: DTXSID4022361      NOAEL        50.000 mg/kg-day
#> 13: DTXSID9023914      LOAEL         1.000 mg/kg-day
#> 14: DTXSID9020243      LOAEL        30.000 mg/kg-day
#> 15: DTXSID6024177      NOAEL         0.200 mg/kg-day
#> 16: DTXSID7024241      NOAEL        10.000 mg/kg-day
#> 17: DTXSID8023846      LOAEL        10.000 mg/kg-day
#> 18: DTXSID7024241      LOAEL        30.000 mg/kg-day
#> 19: DTXSID8025541      NOAEL         0.100 mg/kg-day
#> 20: DTXSID8025541      LOAEL         0.100 mg/kg-day
#> 21: DTXSID1024338      LOAEL         6.000 mg/kg-day
#> 22: DTXSID9032329      NOAEL         5.500 mg/kg-day
#> 23: DTXSID9032329      LOAEL        23.000 mg/kg-day
#> 24: DTXSID6024177      LOAEL         0.650 mg/kg-day
#> 25: DTXSID1024174      LOAEL         7.000 mg/kg-day
#> 26: DTXSID8023848      LOAEL       190.000 mg/kg-day
#> 27: DTXSID4022448      NOAEL        20.000 mg/kg-day
#> 28: DTXSID0021464      LOAEL         7.500 mg/kg-day
#> 29: DTXSID4022448      LOAEL       100.000 mg/kg-day
#> 30: DTXSID3034458      NOAEL        25.000 mg/kg-day
#> 31: DTXSID3034458      LOAEL       100.000 mg/kg-day
#> 32: DTXSID9023914      NOAEL         0.500 mg/kg-day
#> 33: DTXSID1021409      LOAEL         0.300 mg/kg-day
#> 34: DTXSID4022361      LOAEL       200.000 mg/kg-day
#> 35: DTXSID0020446      LOAEL        50.000 mg/kg-day
#> 36: DTXSID0032578      LOAEL         1.000 mg/kg-day
#> 37: DTXSID0032578      NOAEL         0.500 mg/kg-day
#> 38: DTXSID1021409      NOAEL         0.100 mg/kg-day
#> 39: DTXSID1024338      NOAEL         2.000 mg/kg-day
#> 40: DTXSID3032464      LOAEL        60.000 mg/kg-day
#> 41: DTXSID0021464      NOAEL         3.000 mg/kg-day
#> 42: DTXSID4032611      LOAEL         9.000 mg/kg-day
#> 43: DTXSID8023848      NOAEL        50.000 mg/kg-day
#> 44: DTXSID0020446      NOAEL        10.000 mg/kg-day
#> 45: DTXSID4032611      NOAEL         2.000 mg/kg-day
#>            dtxsid toxvalType numeric_value     units
natadb_hazard[humanEcoNt %in% 'human health' & exposureRoute %in% 'oral' &
              riskAssessmentClass %in% 'developmental' & 
              toxvalType %in% c('NOAEL', 'LOAEL') & (toxvalUnits != '-'), 
              .(numeric_value = min(toxvalNumeric), 
              units = toxvalUnits[[which.min(toxvalNumeric)]]), 
              by = .(dtxsid, toxvalType)]
#>            dtxsid toxvalType numeric_value     units
#>            <char>     <char>         <num>    <char>
#>  1: DTXSID8020913      NOAEL        50.000 mg/kg-day
#>  2: DTXSID4020161      NOAEL       500.000     mg/kg
#>  3: DTXSID0021834      NOAEL        13.800     mg/kg
#>  4: DTXSID0020448      NOAEL        30.000     mg/kg
#>  5: DTXSID3022455      NOAEL       840.000     mg/kg
#>  6: DTXSID5020027      NOAEL         2.500     mg/kg
#>  7: DTXSID7021318      NOAEL       987.000     mg/kg
#>  8: DTXSID6020515      NOAEL        44.100     mg/kg
#>  9: DTXSID7021029      LOAEL         0.025 mg/kg-day
#> 10: DTXSID4020874      NOAEL         1.000 mg/kg-day
#> 11: DTXSID4021395      NOAEL       100.000 mg/kg-day
#> 12: DTXSID2021781      LOAEL       630.000 mg/kg-day
#> 13: DTXSID4021395      LOAEL       225.000 mg/kg-day
#> 14: DTXSID2021781      NOAEL       500.000 mg/kg-day
#> 15: DTXSID7021106      LOAEL        30.000 mg/kg-day
#> 16: DTXSID9020247      NOAEL         4.000 mg/kg-day
#> 17: DTXSID7021106      NOAEL        15.000 mg/kg-day
#> 18: DTXSID7021948      LOAEL         9.000 mg/kg-day
#> 19: DTXSID5020607      LOAEL       666.000 mg/kg-day
#> 20: DTXSID5020449      NOAEL         0.100 mg/kg-day
#> 21: DTXSID5020449      LOAEL         2.500 mg/kg-day
#> 22: DTXSID3022455      LOAEL      3570.000 mg/kg-day
#> 23: DTXSID2021105      LOAEL       125.000 mg/kg-day
#> 24: DTXSID2021105      NOAEL        12.500 mg/kg-day
#> 25: DTXSID9020247      LOAEL        30.000 mg/kg-day
#> 26: DTXSID5020607      NOAEL       357.000 mg/kg-day
#> 27: DTXSID7021948      NOAEL         3.000 mg/kg-day
#> 28: DTXSID8020913      LOAEL        50.000 mg/kg-day
#> 29: DTXSID9020243      NOAEL        10.000 mg/kg-day
#> 30: DTXSID9020243      LOAEL        30.000 mg/kg-day
#>            dtxsid toxvalType numeric_value     units

Now, we also explore the values of “RfD”, “RfC”, and “cancer slope factor” of the toxvalType rows. We first determine the set of units for each, make appropriate conversions if necessary, and then make comparisons.

ccl4_hazard[humanEcoNt %in% 'human health' & toxvalType %in% 
            c('cancer slope factor', 'RfD', 'RfC'), .N, 
            by = .(toxvalType, toxvalUnits)][order(toxvalType, -N)]
#>             toxvalType   toxvalUnits     N
#>                 <char>        <char> <int>
#> 1:                 RfC         mg/m3   178
#> 2:                 RfC          g/m3     2
#> 3:                 RfC         ug/m3     2
#> 4:                 RfD     mg/kg-day   300
#> 5:                 RfD         mg/kg     1
#> 6: cancer slope factor (mg/kg-day)-1   149
#> 7: cancer slope factor     mg/kg-day     1
natadb_hazard[humanEcoNt %in% 'human health' & toxvalType %in%
              c('cancer slope factor', 'RfD', 'RfC'), .N, 
              by = .(toxvalType, toxvalUnits)][order(toxvalType, -N)]
#>             toxvalType   toxvalUnits     N
#>                 <char>        <char> <int>
#> 1:                 RfC         mg/m3   476
#> 2:                 RfC           ppm     2
#> 3:                 RfD     mg/kg-day   776
#> 4:                 RfD         mg/kg     1
#> 5: cancer slope factor (mg/kg-day)-1   460
#> 6: cancer slope factor     mg/kg-day     2

For CCL4, there are three inequivalent sets of units that need conversions. We convert to “mg/m3”, which means scaling values given in “g/m3” by 1E3 and values given in “ug/m3” by 1E-3. The Rfd units and cancer slope factor units require no conversions although we do remove the cancer slope factor unit of “mg/kg-day”.

For NATADB, we need to convert RfC values from ppm to mg/m3, with a conversion factor that relies on the molecular weight of the chemical in question. We will remove these from consideration for now. The units for RfD and cancer slope factor require no conversions although we do remove the cancer slope factor unit of “mg/kg-day”.

First, we filter and separate out the relevant data subsets.

# Separate out into relevant data subsets
ccl4_csf <- ccl4_hazard[humanEcoNt %in% 'human health' & 
                          toxvalType %in% c('cancer slope factor') & (toxvalUnits != 'mg/kg-day'), ]
ccl4_rfc <- ccl4_hazard[humanEcoNt %in% 'human health' & 
                          toxvalType %in% c('RfC'), ]
ccl4_rfd <- ccl4_hazard[humanEcoNt %in% 'human health' & 
                          toxvalType %in% c('RfD'), ]

Then we start to handle the unit conversions.

# Set mass by volume units to mg/m3, so scale g/m3 by 1E3 and ug/m3 by 1E-3
ccl4_rfc[toxvalUnits == 'mg/m3', conversion := 1]
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 397149 Alaska DEC      - DTXSID0021541    inhalation       9.0e-02                      =
#>   2: 394651 Alaska DEC      - DTXSID3042219    inhalation       3.5e-02                      =
#>   3: 394579 Alaska DEC      - DTXSID3020964    inhalation       2.0e-03                      =
#>   4: 390390 Alaska DEC      - DTXSID8020832    inhalation       5.0e-03                      =
#>   5: 390677 Alaska DEC      - DTXSID3020833    inhalation       3.0e+00                      =
#>  ---                                                                                          
#> 178: 382357        RSL      - DTXSID2024169             -       5.0e-05                      =
#> 179: 395836        RSL      - DTXSID5021207             -       3.0e-02                      =
#> 180: 380861        RSL      - DTXSID2021731             -       2.0e+01                      =
#> 181: 391656        RSL      - DTXSID6022422             -       2.0e-02                      =
#> 182: 391637        RSL      - DTXSID6022422    inhalation       2.0e-02                      =
#> 27 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
ccl4_rfc[toxvalUnits == 'g/m3', conversion := 1E3]
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 397149 Alaska DEC      - DTXSID0021541    inhalation       9.0e-02                      =
#>   2: 394651 Alaska DEC      - DTXSID3042219    inhalation       3.5e-02                      =
#>   3: 394579 Alaska DEC      - DTXSID3020964    inhalation       2.0e-03                      =
#>   4: 390390 Alaska DEC      - DTXSID8020832    inhalation       5.0e-03                      =
#>   5: 390677 Alaska DEC      - DTXSID3020833    inhalation       3.0e+00                      =
#>  ---                                                                                          
#> 178: 382357        RSL      - DTXSID2024169             -       5.0e-05                      =
#> 179: 395836        RSL      - DTXSID5021207             -       3.0e-02                      =
#> 180: 380861        RSL      - DTXSID2021731             -       2.0e+01                      =
#> 181: 391656        RSL      - DTXSID6022422             -       2.0e-02                      =
#> 182: 391637        RSL      - DTXSID6022422    inhalation       2.0e-02                      =
#> 27 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
ccl4_rfc[toxvalUnits == 'ug/m3', conversion := 1E-3]
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 397149 Alaska DEC      - DTXSID0021541    inhalation       9.0e-02                      =
#>   2: 394651 Alaska DEC      - DTXSID3042219    inhalation       3.5e-02                      =
#>   3: 394579 Alaska DEC      - DTXSID3020964    inhalation       2.0e-03                      =
#>   4: 390390 Alaska DEC      - DTXSID8020832    inhalation       5.0e-03                      =
#>   5: 390677 Alaska DEC      - DTXSID3020833    inhalation       3.0e+00                      =
#>  ---                                                                                          
#> 178: 382357        RSL      - DTXSID2024169             -       5.0e-05                      =
#> 179: 395836        RSL      - DTXSID5021207             -       3.0e-02                      =
#> 180: 380861        RSL      - DTXSID2021731             -       2.0e+01                      =
#> 181: 391656        RSL      - DTXSID6022422             -       2.0e-02                      =
#> 182: 391637        RSL      - DTXSID6022422    inhalation       2.0e-02                      =
#> 27 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
ccl4_rfc[toxvalUnits %in% c('mg/m3', 'g/m3', 'ug/m3'), units := 'mg/m3']
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 397149 Alaska DEC      - DTXSID0021541    inhalation       9.0e-02                      =
#>   2: 394651 Alaska DEC      - DTXSID3042219    inhalation       3.5e-02                      =
#>   3: 394579 Alaska DEC      - DTXSID3020964    inhalation       2.0e-03                      =
#>   4: 390390 Alaska DEC      - DTXSID8020832    inhalation       5.0e-03                      =
#>   5: 390677 Alaska DEC      - DTXSID3020833    inhalation       3.0e+00                      =
#>  ---                                                                                          
#> 178: 382357        RSL      - DTXSID2024169             -       5.0e-05                      =
#> 179: 395836        RSL      - DTXSID5021207             -       3.0e-02                      =
#> 180: 380861        RSL      - DTXSID2021731             -       2.0e+01                      =
#> 181: 391656        RSL      - DTXSID6022422             -       2.0e-02                      =
#> 182: 391637        RSL      - DTXSID6022422    inhalation       2.0e-02                      =
#> 28 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
# Set mass by mass units to mg/kg
ccl4_rfd[toxvalUnits %in% c('mg/kg-day', 'mg/kg'), conversion := 1]
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 385099 Alaska DEC      - DTXSID6021030        dermal        0.0050                      =
#>   2: 376617 Alaska DEC      - DTXSID1021740        dermal        0.0500                      =
#>   3: 381896 Alaska DEC      - DTXSID8020597        dermal        1.0000                      =
#>   4: 385648 Alaska DEC      - DTXSID9024142        dermal        0.0030                      =
#>   5: 392604 Alaska DEC      - DTXSID1020437        dermal        0.2000                      =
#>  ---                                                                                          
#> 297: 387188        RSL      - DTXSID0024052             -        0.0220                      =
#> 298: 374437        RSL      - DTXSID4022448          oral        0.1500                      =
#> 299: 381916        RSL      - DTXSID2021731          oral        2.0000                      =
#> 300: 389619        RSL      - DTXSID1031040          oral        0.0003                      =
#> 301: 376317        RSL      - DTXSID1021740             -        0.1000                      =
#> 27 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
ccl4_rfd[toxvalUnits %in% c('mg/kg-day', 'mg/kg'), units := 'mg/kg']
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 385099 Alaska DEC      - DTXSID6021030        dermal        0.0050                      =
#>   2: 376617 Alaska DEC      - DTXSID1021740        dermal        0.0500                      =
#>   3: 381896 Alaska DEC      - DTXSID8020597        dermal        1.0000                      =
#>   4: 385648 Alaska DEC      - DTXSID9024142        dermal        0.0030                      =
#>   5: 392604 Alaska DEC      - DTXSID1020437        dermal        0.2000                      =
#>  ---                                                                                          
#> 297: 387188        RSL      - DTXSID0024052             -        0.0220                      =
#> 298: 374437        RSL      - DTXSID4022448          oral        0.1500                      =
#> 299: 381916        RSL      - DTXSID2021731          oral        2.0000                      =
#> 300: 389619        RSL      - DTXSID1031040          oral        0.0003                      =
#> 301: 376317        RSL      - DTXSID1021740             -        0.1000                      =
#> 28 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]

Then aggregate the data.

# Run data aggregations grouping by dtxsid and taking either the max or the min
# depending on the toxvalType we are considering.
ccl4_csf[,.(numeric_value = max(toxvalNumeric), 
            units = toxvalUnits[which.max(toxvalNumeric)]), 
         by = .(dtxsid)][order(-numeric_value),]
#>            dtxsid numeric_value         units
#>            <char>         <num>        <char>
#>  1: DTXSID2021028      1.50e+02 (mg/kg-day)-1
#>  2: DTXSID7021029      1.02e+02 (mg/kg-day)-1
#>  3: DTXSID0020573      3.90e+01 (mg/kg-day)-1
#>  4: DTXSID9021390      3.00e+01 (mg/kg-day)-1
#>  5: DTXSID6021032      2.80e+01 (mg/kg-day)-1
#>  6: DTXSID3020702      1.72e+01 (mg/kg-day)-1
#>  7: DTXSID2020684      6.49e+00 (mg/kg-day)-1
#>  8: DTXSID3020203      3.40e+00 (mg/kg-day)-1
#>  9: DTXSID1021798      3.00e+00 (mg/kg-day)-1
#> 10: DTXSID8021062      2.10e+00 (mg/kg-day)-1
#> 11: DTXSID1021409      1.83e+00 (mg/kg-day)-1
#> 12: DTXSID6022422      1.60e+00 (mg/kg-day)-1
#> 13: DTXSID9021427      1.00e+00 (mg/kg-day)-1
#> 14: DTXSID0020600      3.10e-01 (mg/kg-day)-1
#> 15: DTXSID5021207      2.40e-01 (mg/kg-day)-1
#> 16: DTXSID1026164      1.80e-01 (mg/kg-day)-1
#> 17: DTXSID0020153      1.70e-01 (mg/kg-day)-1
#> 18: DTXSID9024142      1.10e-01 (mg/kg-day)-1
#> 19: DTXSID4020533      1.00e-01 (mg/kg-day)-1
#> 20: DTXSID7024241      7.32e-02 (mg/kg-day)-1
#> 21: DTXSID7020005      7.00e-02 (mg/kg-day)-1
#> 22: DTXSID5020601      4.50e-02 (mg/kg-day)-1
#> 23: DTXSID0024341      3.90e-02 (mg/kg-day)-1
#> 24: DTXSID1021407      3.40e-02 (mg/kg-day)-1
#> 25: DTXSID4032611      2.81e-02 (mg/kg-day)-1
#> 26: DTXSID2021317      2.60e-02 (mg/kg-day)-1
#> 27: DTXSID7020637      2.10e-02 (mg/kg-day)-1
#> 28: DTXSID6021030      1.96e-02 (mg/kg-day)-1
#> 29: DTXSID0021541      1.63e-02 (mg/kg-day)-1
#> 30: DTXSID1024338      1.20e-02 (mg/kg-day)-1
#> 31: DTXSID5039224      1.00e-02 (mg/kg-day)-1
#> 32: DTXSID8023846      8.70e-03 (mg/kg-day)-1
#> 33: DTXSID8020090      5.70e-03 (mg/kg-day)-1
#> 34: DTXSID1020437      5.70e-03 (mg/kg-day)-1
#> 35: DTXSID9020243      2.30e-03 (mg/kg-day)-1
#> 36: DTXSID3020833      2.25e-03 (mg/kg-day)-1
#> 37: DTXSID7020215      2.00e-04 (mg/kg-day)-1
#>            dtxsid numeric_value         units
ccl4_rfc[,.(numeric_value = min(toxvalNumeric*conversion), 
            units = units[which.min(toxvalNumeric*conversion)]), 
         by = .(dtxsid)][order(numeric_value),]
#>            dtxsid numeric_value  units
#>            <char>         <num> <char>
#>  1: DTXSID1031040       6.0e-12  mg/m3
#>  2: DTXSID5020023       2.0e-05  mg/m3
#>  3: DTXSID3020702       3.0e-05  mg/m3
#>  4: DTXSID7021029       4.0e-05  mg/m3
#>  5: DTXSID2024169       5.0e-05  mg/m3
#>  6: DTXSID0024341       7.0e-05  mg/m3
#>  7: DTXSID8020044       1.0e-04  mg/m3
#>  8: DTXSID2040282       1.0e-04  mg/m3
#>  9: DTXSID9021390       3.0e-04  mg/m3
#> 10: DTXSID8020090       1.0e-03  mg/m3
#> 11: DTXSID0020153       1.0e-03  mg/m3
#> 12: DTXSID3020964       2.0e-03  mg/m3
#> 13: DTXSID3020203       2.0e-03  mg/m3
#> 14: DTXSID1024207       2.0e-03  mg/m3
#> 15: DTXSID8020832       5.0e-03  mg/m3
#> 16: DTXSID3024366       7.0e-03  mg/m3
#> 17: DTXSID5024182       7.0e-03  mg/m3
#> 18: DTXSID5039224       9.0e-03  mg/m3
#> 19: DTXSID7020637       9.8e-03  mg/m3
#> 20: DTXSID6022422       2.0e-02  mg/m3
#> 21: DTXSID5021207       3.0e-02  mg/m3
#> 22: DTXSID4020533       3.0e-02  mg/m3
#> 23: DTXSID0020600       3.0e-02  mg/m3
#> 24: DTXSID3042219       3.5e-02  mg/m3
#> 25: DTXSID2022333       3.5e-02  mg/m3
#> 26: DTXSID4021503       4.0e-02  mg/m3
#> 27: DTXSID0021541       9.0e-02  mg/m3
#> 28: DTXSID8020597       4.0e-01  mg/m3
#> 29: DTXSID1020437       5.0e-01  mg/m3
#> 30: DTXSID0021917       7.0e-01  mg/m3
#> 31: DTXSID3020833       3.0e+00  mg/m3
#> 32: DTXSID2021731       4.0e+00  mg/m3
#> 33: DTXSID6020301       5.0e+01  mg/m3
#>            dtxsid numeric_value  units
ccl4_rfd[,.(numeric_value = min(toxvalNumeric*conversion), 
            units = units[which.min(toxvalNumeric*conversion)]), 
         by = .(dtxsid)][order(numeric_value),]
#>            dtxsid numeric_value  units
#>            <char>         <num> <char>
#>  1: DTXSID7021029      0.000004  mg/kg
#>  2: DTXSID3031864      0.000020  mg/kg
#>  3: DTXSID8031865      0.000020  mg/kg
#>  4: DTXSID9023914      0.000030  mg/kg
#>  5: DTXSID1024174      0.000030  mg/kg
#>  6: DTXSID6024177      0.000050  mg/kg
#>  7: DTXSID4032611      0.000065  mg/kg
#>  8: DTXSID2040282      0.000070  mg/kg
#>  9: DTXSID5020601      0.000080  mg/kg
#> 10: DTXSID8025541      0.000100  mg/kg
#> 11: DTXSID1021407      0.000100  mg/kg
#> 12: DTXSID3032464      0.000120  mg/kg
#> 13: DTXSID8023846      0.000300  mg/kg
#> 14: DTXSID1031040      0.000300  mg/kg
#> 15: DTXSID3020964      0.000485  mg/kg
#> 16: DTXSID5020023      0.000500  mg/kg
#> 17: DTXSID9024142      0.000800  mg/kg
#> 18: DTXSID3020702      0.000900  mg/kg
#> 19: DTXSID8020832      0.001000  mg/kg
#> 20: DTXSID5021207      0.001000  mg/kg
#> 21: DTXSID1021409      0.001000  mg/kg
#> 22: DTXSID4022361      0.001200  mg/kg
#> 23: DTXSID0032578      0.001500  mg/kg
#> 24: DTXSID0020446      0.002000  mg/kg
#> 25: DTXSID0020153      0.002000  mg/kg
#> 26: DTXSID9021390      0.004000  mg/kg
#> 27: DTXSID8020044      0.004000  mg/kg
#> 28: DTXSID6021030      0.005000  mg/kg
#> 29: DTXSID9032329      0.005000  mg/kg
#> 30: DTXSID1024207      0.005000  mg/kg
#> 31: DTXSID5024182      0.005000  mg/kg
#> 32: DTXSID8020090      0.007000  mg/kg
#> 33: DTXSID2020684      0.008000  mg/kg
#> 34: DTXSID3042219      0.010000  mg/kg
#> 35: DTXSID2022333      0.010000  mg/kg
#> 36: DTXSID4021503      0.010000  mg/kg
#> 37: DTXSID0021464      0.016000  mg/kg
#> 38: DTXSID8023848      0.020000  mg/kg
#> 39: DTXSID4034948      0.020000  mg/kg
#> 40: DTXSID1026164      0.020000  mg/kg
#> 41: DTXSID0024052      0.021800  mg/kg
#> 42: DTXSID2024169      0.024000  mg/kg
#> 43: DTXSID0021541      0.025700  mg/kg
#> 44: DTXSID9032113      0.029000  mg/kg
#> 45: DTXSID2021317      0.030000  mg/kg
#> 46: DTXSID4020533      0.030000  mg/kg
#> 47: DTXSID7024241      0.040000  mg/kg
#> 48: DTXSID1021740      0.050000  mg/kg
#> 49: DTXSID8022292      0.050000  mg/kg
#> 50: DTXSID0021917      0.060000  mg/kg
#> 51: DTXSID9020243      0.100000  mg/kg
#> 52: DTXSID4022448      0.100000  mg/kg
#> 53: DTXSID1020437      0.143000  mg/kg
#> 54: DTXSID1024338      0.160000  mg/kg
#> 55: DTXSID7020637      0.200000  mg/kg
#> 56: DTXSID3034458      0.300000  mg/kg
#> 57: DTXSID2021731      0.500000  mg/kg
#> 58: DTXSID8020597      0.800000  mg/kg
#> 59: DTXSID3020833      0.857000  mg/kg
#>            dtxsid numeric_value  units

Repeat the process for NATADB, first separating out the relevant subsets of the data.

# Separate out into relevant data subsets
natadb_csf <- natadb_hazard[humanEcoNt %in% 'human health' & 
                              toxvalType %in% c('cancer slope factor') & (toxvalUnits != 'mg/kg-day'), ]
natadb_rfc <- natadb_hazard[humanEcoNt %in% 'human health' &
                              toxvalType %in% c('RfC'), ]
natadb_rfd <- natadb_hazard[humanEcoNt %in% 'human health' & 
                              toxvalType %in% c('RfD'), ]

Now handle the unit conversions.

# Set mass by mass units to mg/kg. Note that ppm is already in mg/kg
natadb_rfc <- natadb_rfc[toxvalUnits != 'ppm',]
natadb_rfd[, units := 'mg/kg-day']
#>          id     source   year        dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#>       <int>     <char> <char>        <char>        <char>         <num>                 <char>
#>   1: 384500 Alaska DEC      - DTXSID5020607        dermal        0.0038                      =
#>   2: 398757 Alaska DEC      - DTXSID3020679          oral        0.0005                      =
#>   3: 392604 Alaska DEC      - DTXSID1020437        dermal        0.2000                      =
#>   4: 376366 Alaska DEC      - DTXSID2021284          oral        0.2000                      =
#>   5: 380884 Alaska DEC      - DTXSID8020597          oral        2.0000                      =
#>  ---                                                                                          
#> 773: 399356        RSL      - DTXSID3020596          oral        0.0500                      =
#> 774: 392719        RSL      - DTXSID2021993          oral        0.0010                      =
#> 775: 395204        RSL      - DTXSID5020316          oral        0.0200                      =
#> 776: 395902        RSL      - DTXSID2021159             -        2.0000                      =
#> 777: 396090        RSL      - DTXSID8021438          oral        0.0500                      =
#> 27 variable(s) not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]

Finally, aggregate the data.

# Run data aggregations grouping by dtxsid and taking either the max or the min
# depending on the toxvalType we are considering.
natadb_csf[, .(numeric_value = max(toxvalNumeric), 
               units = toxvalUnits[which.max(toxvalNumeric)]), 
           by = .(dtxsid)][order(-numeric_value),]
#>            dtxsid numeric_value         units
#>            <char>         <num>        <char>
#>  1: DTXSID2020137      5.00e+02 (mg/kg-day)-1
#>  2: DTXSID8020173      2.20e+02 (mg/kg-day)-1
#>  3: DTXSID4021006      1.20e+02 (mg/kg-day)-1
#>  4: DTXSID7021029      1.02e+02 (mg/kg-day)-1
#>  5: DTXSID8020599      6.50e+01 (mg/kg-day)-1
#>  6: DTXSID9020168      4.00e+01 (mg/kg-day)-1
#>  7: DTXSID5020071      2.10e+01 (mg/kg-day)-1
#>  8: DTXSID3020702      1.72e+01 (mg/kg-day)-1
#>  9: DTXSID8021197      1.40e+01 (mg/kg-day)-1
#> 10: DTXSID1020512      1.30e+01 (mg/kg-day)-1
#> 11: DTXSID1020148      1.30e+01 (mg/kg-day)-1
#> 12: DTXSID5024059      1.10e+01 (mg/kg-day)-1
#> 13: DTXSID3020413      7.00e+00 (mg/kg-day)-1
#> 14: DTXSID4021056      6.70e+00 (mg/kg-day)-1
#> 15: DTXSID3020679      6.25e+00 (mg/kg-day)-1
#> 16: DTXSID5020491      4.60e+00 (mg/kg-day)-1
#> 17: DTXSID5020027      4.50e+00 (mg/kg-day)-1
#> 18: DTXSID4020402      4.00e+00 (mg/kg-day)-1
#> 19: DTXSID7020687      4.00e+00 (mg/kg-day)-1
#> 20: DTXSID0039227      3.80e+00 (mg/kg-day)-1
#> 21: DTXSID3020203      3.40e+00 (mg/kg-day)-1
#> 22: DTXSID2020682      3.20e+00 (mg/kg-day)-1
#> 23: DTXSID1021798      3.00e+00 (mg/kg-day)-1
#> 24: DTXSID0021383      2.67e+00 (mg/kg-day)-1
#> 25: DTXSID3020415      2.50e+00 (mg/kg-day)-1
#> 26: DTXSID8021195      2.40e+00 (mg/kg-day)-1
#> 27: DTXSID6020307      2.40e+00 (mg/kg-day)-1
#> 28: DTXSID7021368      2.20e+00 (mg/kg-day)-1
#> 29: DTXSID5024267      2.00e+00 (mg/kg-day)-1
#> 30: DTXSID6022422      1.60e+00 (mg/kg-day)-1
#> 31: DTXSID3025091      1.60e+00 (mg/kg-day)-1
#> 32: DTXSID8021434      1.50e+00 (mg/kg-day)-1
#> 33: DTXSID5020865      1.50e+00 (mg/kg-day)-1
#> 34: DTXSID7020267      1.30e+00 (mg/kg-day)-1
#> 35: DTXSID6020432      1.20e+00 (mg/kg-day)-1
#> 36: DTXSID5020029      1.00e+00 (mg/kg-day)-1
#> 37: DTXSID7020710      8.70e-01 (mg/kg-day)-1
#> 38: DTXSID0020529      8.00e-01 (mg/kg-day)-1
#> 39: DTXSID8021438      6.00e-01 (mg/kg-day)-1
#> 40: DTXSID2021319      5.40e-01 (mg/kg-day)-1
#> 41: DTXSID7021106      4.00e-01 (mg/kg-day)-1
#> 42: DTXSID0020600      3.10e-01 (mg/kg-day)-1
#> 43: DTXSID5020449      2.90e-01 (mg/kg-day)-1
#> 44: DTXSID7021318      2.86e-01 (mg/kg-day)-1
#> 45: DTXSID2021105      2.60e-01 (mg/kg-day)-1
#> 46: DTXSID5021207      2.40e-01 (mg/kg-day)-1
#> 47: DTXSID8020250      2.00e-01 (mg/kg-day)-1
#> 48: DTXSID1022057      1.82e-01 (mg/kg-day)-1
#> 49: DTXSID1026164      1.80e-01 (mg/kg-day)-1
#> 50: DTXSID0020153      1.70e-01 (mg/kg-day)-1
#> 51: DTXSID2021286      1.60e-01 (mg/kg-day)-1
#> 52: DTXSID7020683      1.56e-01 (mg/kg-day)-1
#> 53: DTXSID8020913      1.20e-01 (mg/kg-day)-1
#> 54: DTXSID9020299      1.10e-01 (mg/kg-day)-1
#> 55: DTXSID3039242      1.00e-01 (mg/kg-day)-1
#> 56: DTXSID4020533      1.00e-01 (mg/kg-day)-1
#> 57: DTXSID0020448      9.19e-02 (mg/kg-day)-1
#> 58: DTXSID6020438      9.10e-02 (mg/kg-day)-1
#> 59: DTXSID1020306      8.05e-02 (mg/kg-day)-1
#> 60: DTXSID1020566      8.00e-02 (mg/kg-day)-1
#> 61: DTXSID5020607      7.37e-02 (mg/kg-day)-1
#> 62: DTXSID5021380      7.04e-02 (mg/kg-day)-1
#> 63: DTXSID5021386      7.00e-02 (mg/kg-day)-1
#> 64: DTXSID7020005      7.00e-02 (mg/kg-day)-1
#> 65: DTXSID7020716      6.00e-02 (mg/kg-day)-1
#> 66: DTXSID4020583      4.80e-02 (mg/kg-day)-1
#> 67: DTXSID5020601      4.50e-02 (mg/kg-day)-1
#> 68: DTXSID7020689      4.00e-02 (mg/kg-day)-1
#> 69: DTXSID1020431      4.00e-02 (mg/kg-day)-1
#> 70: DTXSID7026156      3.90e-02 (mg/kg-day)-1
#> 71: DTXSID0021965      2.90e-02 (mg/kg-day)-1
#> 72: DTXSID2020507      2.70e-02 (mg/kg-day)-1
#> 73: DTXSID4039231      2.10e-02 (mg/kg-day)-1
#> 74: DTXSID7020637      2.10e-02 (mg/kg-day)-1
#> 75: DTXSID0021541      1.63e-02 (mg/kg-day)-1
#> 76: DTXSID0020868      1.40e-02 (mg/kg-day)-1
#> 77: DTXSID1021374      1.32e-02 (mg/kg-day)-1
#> 78: DTXSID3020596      1.10e-02 (mg/kg-day)-1
#> 79: DTXSID5039224      1.00e-02 (mg/kg-day)-1
#> 80: DTXSID4020161      8.00e-03 (mg/kg-day)-1
#> 81: DTXSID4021395      7.70e-03 (mg/kg-day)-1
#> 82: DTXSID8020090      5.70e-03 (mg/kg-day)-1
#> 83: DTXSID1020437      5.70e-03 (mg/kg-day)-1
#> 84: DTXSID1020302      3.63e-03 (mg/kg-day)-1
#> 85: DTXSID7021948      3.52e-03 (mg/kg-day)-1
#> 86: DTXSID9020243      2.30e-03 (mg/kg-day)-1
#> 87: DTXSID3020833      2.25e-03 (mg/kg-day)-1
#> 88: DTXSID8020759      1.90e-03 (mg/kg-day)-1
#>            dtxsid numeric_value         units
natadb_rfc[, .(numeric_value = min(toxvalNumeric), 
               units = toxvalUnits[which.min(toxvalNumeric)]), 
           by = .(dtxsid)][order(numeric_value),]
#>            dtxsid numeric_value  units
#>            <char>         <num> <char>
#>  1: DTXSID1020516    0.00000200  mg/m3
#>  2: DTXSID7026156    0.00000800  mg/m3
#>  3: DTXSID4024143    0.00001000  mg/m3
#>  4: DTXSID7025180    0.00002000  mg/m3
#>  5: DTXSID5020023    0.00002000  mg/m3
#>  6: DTXSID4020874    0.00002000  mg/m3
#>  7: DTXSID9020293    0.00003000  mg/m3
#>  8: DTXSID3020702    0.00003000  mg/m3
#>  9: DTXSID7021029    0.00004000  mg/m3
#> 10: DTXSID8042476    0.00010000  mg/m3
#> 11: DTXSID1020273    0.00014501  mg/m3
#> 12: DTXSID2020688    0.00020000  mg/m3
#> 13: DTXSID0039229    0.00020000  mg/m3
#> 14: DTXSID3020413    0.00020000  mg/m3
#> 15: DTXSID5021380    0.00020000  mg/m3
#> 16: DTXSID3021932    0.00020000  mg/m3
#> 17: DTXSID2021157    0.00030000  mg/m3
#> 18: DTXSID0024260    0.00030000  mg/m3
#> 19: DTXSID4020161    0.00040000  mg/m3
#> 20: DTXSID5020449    0.00050000  mg/m3
#> 21: DTXSID7020267    0.00070000  mg/m3
#> 22: DTXSID7024166    0.00070000  mg/m3
#> 23: DTXSID8020090    0.00100000  mg/m3
#> 24: DTXSID4039231    0.00100000  mg/m3
#> 25: DTXSID1020566    0.00100000  mg/m3
#> 26: DTXSID1023786    0.00100000  mg/m3
#> 27: DTXSID0020153    0.00100000  mg/m3
#> 28: DTXSID3020964    0.00200000  mg/m3
#> 29: DTXSID0021383    0.00200000  mg/m3
#> 30: DTXSID0021965    0.00200000  mg/m3
#> 31: DTXSID3020415    0.00200000  mg/m3
#> 32: DTXSID5020029    0.00200000  mg/m3
#> 33: DTXSID3020203    0.00200000  mg/m3
#> 34: DTXSID8020913    0.00300000  mg/m3
#> 35: DTXSID8021432    0.00300000  mg/m3
#> 36: DTXSID0020448    0.00400000  mg/m3
#> 37: DTXSID8020832    0.00500000  mg/m3
#> 38: DTXSID1020148    0.00500000  mg/m3
#> 39: DTXSID5020027    0.00600000  mg/m3
#> 40: DTXSID3024366    0.00700000  mg/m3
#> 41: DTXSID6020438    0.00700000  mg/m3
#> 42: DTXSID2021658    0.00800000  mg/m3
#> 43: DTXSID4020583    0.00800000  mg/m3
#> 44: DTXSID5039224    0.00900000  mg/m3
#> 45: DTXSID7020637    0.00980000  mg/m3
#> 46: DTXSID1049641    0.01400000  mg/m3
#> 47: DTXSID1022057    0.02000000  mg/m3
#> 48: DTXSID5020316    0.02000000  mg/m3
#> 49: DTXSID2021159    0.02000000  mg/m3
#> 50: DTXSID6020981    0.02000000  mg/m3
#> 51: DTXSID6020569    0.02000000  mg/m3
#> 52: DTXSID2020711    0.02000000  mg/m3
#> 53: DTXSID6022422    0.02000000  mg/m3
#> 54: DTXSID9021261    0.02000000  mg/m3
#> 55: DTXSID3039242    0.03000000  mg/m3
#> 56: DTXSID5021207    0.03000000  mg/m3
#> 57: DTXSID7020689    0.03000000  mg/m3
#> 58: DTXSID4020533    0.03000000  mg/m3
#> 59: DTXSID6020515    0.03000000  mg/m3
#> 60: DTXSID0020600    0.03000000  mg/m3
#> 61: DTXSID2021319    0.04000000  mg/m3
#> 62: DTXSID4020298    0.05000000  mg/m3
#> 63: DTXSID7020009    0.06000000  mg/m3
#> 64: DTXSID8021434    0.08000000  mg/m3
#> 65: DTXSID1021827    0.09000000  mg/m3
#> 66: DTXSID0021541    0.09000000  mg/m3
#> 67: DTXSID1020306    0.09765240  mg/m3
#> 68: DTXSID2021446    0.10000000  mg/m3
#> 69: DTXSID8020250    0.10000000  mg/m3
#> 70: DTXSID6023949    0.10000000  mg/m3
#> 71: DTXSID8021438    0.20000000  mg/m3
#> 72: DTXSID3021431    0.20000000  mg/m3
#> 73: DTXSID5021124    0.20000000  mg/m3
#> 74: DTXSID8020597    0.40000000  mg/m3
#> 75: DTXSID1020437    0.50000000  mg/m3
#> 76: DTXSID0020868    0.60000000  mg/m3
#> 77: DTXSID6023947    0.70000000  mg/m3
#> 78: DTXSID0021917    0.70000000  mg/m3
#> 79: DTXSID2020844    0.70000000  mg/m3
#> 80: DTXSID5021889    0.80000000  mg/m3
#> 81: DTXSID1020431    0.80000000  mg/m3
#> 82: DTXSID3020596    1.00000000  mg/m3
#> 83: DTXSID2021284    1.00000000  mg/m3
#> 84: DTXSID8020759    2.00000000  mg/m3
#> 85: DTXSID0021381    2.20000000  mg/m3
#> 86: DTXSID3020833    3.00000000  mg/m3
#> 87: DTXSID1020302    4.00000000  mg/m3
#> 88: DTXSID2021731    4.00000000  mg/m3
#> 89: DTXSID7021360    5.00000000  mg/m3
#>            dtxsid numeric_value  units
natadb_rfd[, .(numeric_value = min(toxvalNumeric), 
               units = units[which.min(toxvalNumeric)]), 
           by = .(dtxsid)][order(numeric_value),]
#>             dtxsid numeric_value     units
#>             <char>         <num>    <char>
#>   1: DTXSID7021029      0.000004 mg/kg-day
#>   2: DTXSID2020682      0.000010 mg/kg-day
#>   3: DTXSID9020827      0.000020 mg/kg-day
#>   4: DTXSID1024382      0.000020 mg/kg-day
#>   5: DTXSID3020679      0.000030 mg/kg-day
#>  ---                                      
#> 103: DTXSID5021124      0.270000 mg/kg-day
#> 104: DTXSID2021731      0.500000 mg/kg-day
#> 105: DTXSID8020597      0.800000 mg/kg-day
#> 106: DTXSID3020833      0.857000 mg/kg-day
#> 107: DTXSID2021159      2.000000 mg/kg-day

Answer to Environmental Health Question 8

With these results, we may answer Environmental Health Question 8: When examining different toxicity values, the data may be reported in multiple units. To assess the relative hazard from this data, it is important to take into account the different units and adjust accordingly. List the units reported for the cancer slope factor, reference dose, and reference concentration values associated with the oral and inhalation exposure routes for human hazard. Which chemicals in each data set have the highest cancer slope factor, lowest reference dose, and lowest reference concentration values?

Answer: The units for these three toxicity value types for CCL4 are given by “mg/m3”, “g/m3”, “ug/m3” for RfC, “mg/kg-day”, “mg/kg” for RfD, and “(mg/kg-day)-1” for Cancer Slope Factor. For NATADB, the units for RfC are given by “mg/m3” and “ppm”, for RfD by “mg/kg-day”, “mg/kg”, and for Cancer Slope Factor by “(mg/kg-day)-1”. For CCL4, the chemical DTXSID2021028 has the highest CsF at 150 (mg/kg-day)-1, the chemical DTXSID1031040 has the lowest RfC value at 6.0e-12 mg/m3, and the chemical DTXSID7021029 has the lowest RfD value at 4e-6 mg/kg. For NATADB, the chemical DTXSID2020137 has the highest CsF at 500 (mg/kg-day)-1, the chemical DTXSID1020516 has the lowest RfC value at 2.0e-6 mg/m3, and the chemical DTXSID7021029 had the lowest RfD at 4e-6 mg/kg-day.

Concluding Remarks

In conclusion, we explored how one can access publicly available data from the CompTox Chemicals Dashboard programmatically using the CTX APIs via the ctxR R package. In the examples above, we investigated different types of data associated with chemicals, visualized and aggregated the data, and employed different data wrangling techniques using data.tables to answer the proposed environmental health questions. With these tools, one can build workflows that take in a list of chemicals and gather and process data associated with those chemicals through the CTX APIs. Consider how you might use this functionality for building models in your own work.


Try running the same analysis of physical-chemical properties, genotoxicity data, and hazard data on a different pair of data sets available from the CCD. For instance, try pulling the data set ‘BIOSOLIDS2021’ and work through the same steps we completed in this module to investigate the chemicals in this list to gain a better understanding of the associated data.