7.3 CompTox Dashboard Data through APIs
This training module was developed by Paul Kruse and Caroline Ring, with contributions from Julia E. Rager. It was updated and edited November 2025.
All input files (script, data, and figures) can be downloaded from the UNC-SRP TAME2 GitHub website.
Disclaimer: The views expressed in this document are those of the authors and do not necessarily reflect the views or policies of the U.S. EPA.
Introduction to Training Module
Environmental health research related to chemical exposures often requires accessing and wrangling chemical-specific data. The CompTox Chemicals Dashboard (CCD), developed by the United States Environmental Protection Agency, is a publicly-accessible database that integrates chemical data from multiple domains. Chemical data available on the CCD include physicochemical, environmental fate and transport, exposure, toxicokinetics, functional use, in vivo toxicity, in vitro bioassay, and mass spectra data. The CCD was first described in Williams et al. (2017), and has been continuously expanded since. The CCD is heavily used by researchers who do cheminformatics work of various kinds – computational toxicology, computational exposure science, analytical chemistry, chemical safety assessment, etc. The CCD is used by cheminformaticians not only at EPA, but across governmental agencies both within the U.S. and worldwide; in private industry; in non-governmental organizations; in academia; and others. It has become an indispensable tool for many researchers.
This training module provides an overview of the physico-chemical, hazard, and bioactivity data available through the CCD; different ways to access these data; and some examples of how these data may be used. We will first introduce the CCD and how to access it. Then we will focus on an automated, programmatic method for retrieving data from the CCD using the ctxR R package. Through some basic data visualization and analysis using the R programming language, we will explore some data retrieved from the CCD, and gain insights both in how to wrangle the data and combine different methods of accessing the data to build automated pipelines for use in more complex settings.
Note, as the ctxR package accesses data that is periodically updated, some code chunks will produce numbers that may change slightly with data updates. Keep this in mind when running these code chunks in the future.
Training Module’s Environmental Health Questions
This training module was specifically developed to answer the following questions:
After automatically pulling the fourth Drinking Water Contaminant Candidate List from the CompTox Chemicals Dashboard, list the properties and property types present in the data. What are the mean values for a specific property when grouped by property type and when ungrouped?
The physico-chemical property data are reported with both experimental and predicted values present for many chemicals. Are there differences between the mean predicted and experimental results for a variety of physico-chemical properties?
After pulling the genotoxicity data for the different environmental contaminant data sets, list the assays associated with the chemicals in each data set. How many unique assays are there in each data set? What are the different assay categories and how many unique assays for each assay category are there?
The genotoxicity data contains information on which assays have been conducted for different chemicals and the results of those assays. How many chemicals in each data set have a ‘positive’, ‘negative’, and ‘equivocal’ value for the assay result?
Based on the genotoxicity data reported for the chemical with DTXSID identifier DTXSID0020153, how many assays resulted in a positive/equivocal/negative value? Which of the assays were positive and how many of each were there for the most reported assays?
After pulling the hazard data for the different data sets, list the different exposure routes for which there is data. What are the unique risk assessment classes for hazard values for the oral route and for the inhalation exposure route? For each such exposure route, which risk assessment class is most represented by the data sets?
There are several types of toxicity values for each exposure route. List the unique toxicity values for the oral and inhalation routes. What are the unique types of toxicity values for the oral route and for the inhalation route? How many of these are common to both the oral and inhalation routes for each data set?
When examining different toxicity values, the data may be reported in multiple units. To assess the relative hazard from this data, it is important to take into account the different units and adjust accordingly. List the units reported for the cancer slope factor, reference dose, and reference concentration values associated with the oral and inhalation exposure routes for human hazard. Which chemicals in each data set have the highest cancer slope factor, lowest reference dose, and lowest reference concentration values?
Script Preparations
Loading R Packages
# Used to interface with CompTox Chemicals Dashboard
library(ctxR)
#> Warning: package 'ctxR' was built under R version 4.4.1
#> ℹ CCTE's Terms of Service: <https://www.epa.gov/comptox-tools/computational-toxicology-and-exposure-apis>
#> ℹ Please cite ctxR if you use it! Use `citation('ctxR')` for details.
#> No config file or environment variable found: API access unlikely.
# Used to visualize data in a variety of plot designs
library(ggplot2)Introduction to CompTox Chemicals Dashboard
Accessing chemical data and wrangling it is a vital step in many types of workflows related to chemical, biological, and environmental modeling. While there are many resources available from which one can pull data, the CompTox Chemicals Dashboard built and maintained by the United States Environmental Protection Agency is particularly well-designed and suitable for these purposes. Originally introduced in The CompTox Chemistry Dashboard: a community data resource for environmental chemistry, the CCD contains information on over 1.2 million chemicals as of December 2023.
The CCD includes chemical information from many different domains, including physicochemical, environmental fate and transport, exposure, usage, in vivo toxicity, and in vitro bioassay data (Williams et al., 2017).
The CCD can be searched either one chemical at a time, or using a batch search.
Searching One Chemical at a Time (Single-substance Search)
In single-substance search, the user types a full or partial chemical identifier (name, CASRN, InChiKey, or DSSTox ID) into a search box on the CCD homepage. Autocomplete provides a list of possible matches; the user selects one by clicking on it, and is then taken to the CCD page for that substance. Here is an example of the CCD page for the chemical Bisphenol A:

The different domains of data available for this chemical are shown by the tabs on the left side of the page: for example, “Physchem Prop.” (physico-chemical properties), “Env. Fate/Transport” (environmental fate and transport data), and “Hazard Data” (in vivo hazard and toxicity data), among others.
Batch Search
In batch search, the user enters a list of search inputs, separated by newlines, into a batch-search box on https://comptox.epa.gov/dashboard/batch-search . The user selects the type(s) of inputs by selecting one or more checkboxes – these may include chemical identifiers, monoisotopic masses, or molecular formulas. Then, the user selects “Display All Chemicals” to display the list of substances matching the batch-search inputs, or “Choose Export Options” to choose options for exporting the batch-search results as a spreadsheet. The exported spreadsheet may include data from most of the domains available on an individual substance’s CCD page.

The user can download the selected information in various formats, such as Excel (.xlsx), comma-separated values (.csv), or different types of chemical table files (.e.g, MOL).

The web interface for batch search only allows input of 10,000 identifiers at a time. If a user needs to retrieve information for more than 10,000 chemicals, they will need to separate their identifiers into multiple batches and search each one separately.
Challenges of Web-based Dashboard Search
Practicing researchers typically end up with a Dashboard workflow that looks something like this:
- Start with a dataset that includes your chemical identifiers of interest. These may include chemical names, Chemical Abstract Service Registry Numbers (CASRNs), Distributed Searchable Structure-Toxicity Database (DSSTox) identifiers, or InChIKeys.
- Export the chemical identifiers to a spreadsheet. Often, this is done by importing the data into an environment such as R or Python, in order to do some data wrangling (e.g., to select only the unique substance identfiers; to clean up improperly-formatted CASRNs; etc.). Then, the identifiers are saved in a spreadsheet (an Excel, .csv, or .txt file), one chemical identifier per row.
- Copy and paste the chemical identifiers from the spreadsheet into the CCD Batch Search box. If there are more than 10,000 total chemical identifiers, divide them into batches of 10,000 or less, and search each batch separately.
- Choose your desired export options on the CCD Batch Search page.
- Download the exported spreadsheet of CCD data. By default, the downloaded spreadsheet will be given a file name that includes the timestamp of the download.
- Repeat steps 3-5 for each batch of 10,000 identifiers produced in step 2.
- Import the downloaded spreadsheet(s) of CCD data into the analysis tool you are using (e.g. R or Python).
- Merge the table(s) of downloaded CCD data with your original dataset of interest.
- Proceed with research-related data analysis using the chemical data downloaded from the CCD (e.g., statistical modeling, visualization, etc.)
Because each of these workflow steps requires manual interaction with the search and download process, the risk of human error inevitably creeps in. Here are a few real-world possibilities (the authors can neither confirm nor deny that they have personally committed any of these errors):
- Researchers could copy/paste the wrong identifiers into the CCD batch search, especially if they have more than 10,000 identifiers and have to divide them into batches.
- Chemical identifiers could be corrupted during the process of exporting to a spreadsheet. For example, if a researcher opens and resaves a CSV file using Microsoft Excel, any information that appears to be in date-like format will be automatically converted to a date (unless the researcher has the most recently-updated version of Excel and has selected the option in Settings that will stop Excel from auto-detecting dates). This behavior has long been identified as a problem in genomics, where gene names can appear date-like to Excel (Abeysooriya et al. 2021). It also affects cheminformatics, where chemical identifiers can appear date-like to Excel. For example, the valid CASRN “1990-07-4” would automatically be converted to “07/04/1990” (if Excel is set to use MM/DD/YYYY date formats). CCD batch search cannot recognize “07/04/1990” as a valid chemical identifier and will be unable to return any chemical data.
- Researchers could accidentally rename a downloaded CCD data file to overwrite a previous download (for example, when searching multiple batches of identifiers).
- Researchers could mistakenly import the wrong CCD download file back into their analysis environment (for example, when searching multiple batches of identifiers).
Moreover, the manual stages of this kind of workflow are also non-transparent and not easily reproducible.
CCTE’s CTX Application Programming Interfaces (APIs) for Automated Batch Search of the CCD
Recently, the Center for Computational Toxicology and Exposure (CCTE) developed a set of Application Programming Interfaces (APIs) that allows programmatic access to the CCD, bypassing the manual steps of the web-based batch search workflow. The Computational Toxicology and Exposure (CTX) APIs effectively automate the process of accessing and downloading data from the web pages that make up the CCD.
The CTX APIs are publicly available at no cost to the user. However, in order to use the CTX APIs, you must have an API key. The API key uniquely identifies you to the CTX servers and verifies that you have permission to access the database. Getting an API key is free, but requires contacting the API support team at ccte_api@epa.gov.
For more information on the data accessible through the CTX APIs and related tools, please visit the US EPA page on Computational Toxicology and Exposure Online Resources. The CTX APIs are one of many resources developed within this research realm and make available many of the data resources beyond the CCD.
The APIs are organized into four sets of “endpoints” (chemical data domains): Chemical, Hazard, Bioactivity, and Exposure. Pictured below is what the Chemical section looks like and can be found at CTX API Chemical Endpoints.

The APIs can be explored through the pictured web interface at https://api-ccte.epa.gov/docs/chemical.html .
CTX API Authentication
Authentication is the first tab on the left. Authentication is required to use the APIs. To authenticate yourself in the API web interface, input your unique API key.

CTX API Endpoints
On the left of the API web interface, there are several different tabs, one for each endpoint in the Chemical domain. The endpoints are organized by the type of information provided. For instance, the Chemical Details Resource endpoint provides basic chemical information; the Chemical Property Resource endpoint provides more comprehensive physico-chemical property information; the Chemical Fate Resource endpoint provides chemical fate and transport information; and so on.
Constructing CTX API Requests
As mentioned above, APIs effectively automate the process of accessing and downloading data from the web pages that make up the CCD. APIs do this by automatically constructing requests using the Hypertext Transfer Protocol (HTTP) that enables communication between clients (e.g. your computer) and servers (e.g. the CCD).
In the CTX API web interface, the colored boxes next to each endpoint indicate the type of the associated HTTP method: either a GET request (“GET”, blue) or a a POST request (“POS”, green). GET is used to request data from a specific web resource (e.g. a specific URL); POST is used to send data to a server to create or update a web resource. For the CTX APIs, POST requests are used to perform multiple (batch) searches in a single API call; GET requests are used for non-batch searches. You do not need to understand the details of POST and GET requests in order to use the API.
Click on the second item under Chemical Details Resource, the tab labeled Get data by dtxsid. The following page will appear.

This page has two subheadings: “Path Parameters” and “Query-String Parameters”. “Path Parameters” contains user-specified parameters that are required in order to tell the API what URL (web address) to access. In this case, the required parameter is a string for the DTXSID identifying the chemical to be searched.
“Query-String Parameters” contain user-specific parameters (usually optional) that tell the API what specific type(s) of information to download from the specified URL. In this case, the optional parameter is a projection parameter, a string that can take one of five values (chemicaldetailall, chemicaldetailstandard, chemicalidentifier, chemicalstructure, ntatoolkit). Depending on the value of this string, the API can return different sets of information about the chemical. If the projection parameter is left blank, then a default set of chemical information is returned.
The default return format is displayed below and includes a variety of fields with data types represented.

We show what reRturned data from searching Bisphenol A looks like using this endpoint with the chemicaldetailstandard value for projection selected.

Formatting an http request is not necessarily intuitive nor worth the time for someone not already familiar with the process, so these endpoints may provide a resource that for many would require a significant investment in time and energy to learn how to use. However, there is a solution to this in the form of the R package ctxR.
ctxR was developed to streamline the process of accessing the information available through the CTX APIs without requiring prior knowledge of how to use APIs. The ctxR package is available in stable form on CRAN and a development version may be found at the USEPA ctxR GitHub repository. As an example, we demonstrate the ease with which one may retrieve the information given by this endpoint for Bisphenol A using the ctxR approach and contrast it with the approach using the CCD website or CTX Chemical API Endpoint website.
Setting, using, and storing the API key
We store the API key required to access the APIs. To do this for the current session, run the first command. If you want to store your key across multiple sessions, run the second command.
# This stores the key in the current session
register_ctx_api_key(key = '<YOUR API KEY>')
# This stores the key across multiple sessions and only needs to be run once.
# If the key changes, rerun this with the new key.
register_ctx_api_key(key = '<YOUR API KEY>', write = TRUE)To check that your key has successfully been stored for the session, run the following command.
Retrieving chemical details
Now, we demonstrate how to retrieve the information for BPA given by the Chemical Detail Resource endpoint under the chemicaldetailstandard value for projection. Note, this projection value is the default value for the function get_chemical_details().
BPA_chemical_detail <- get_chemical_details(DTXSID = 'DTXSID7020182')
dim(BPA_chemical_detail)
#> [1] 1 37
class(BPA_chemical_detail)
#> [1] "data.table" "data.frame"
names(BPA_chemical_detail)
#> [1] "id" "cpdataCount" "inchikey"
#> [4] "wikipediaArticle" "monoisotopicMass" "percentAssays"
#> [7] "pubchemCount" "pubmedCount" "sourcesCount"
#> [10] "qcLevel" "qcLevelDesc" "isotope"
#> [13] "multicomponent" "totalAssays" "pubchemCid"
#> [16] "relatedSubstanceCount" "relatedStructureCount" "casrn"
#> [19] "compoundId" "genericSubstanceId" "preferredName"
#> [22] "activeAssays" "molFormula" "hasStructureImage"
#> [25] "iupacName" "smiles" "inchiString"
#> [28] "qcNotes" "qsarReadySmiles" "msReadySmiles"
#> [31] "irisLink" "pprtvLink" "descriptorStringTsv"
#> [34] "isMarkush" "dtxsid" "dtxcid"
#> [37] "toxcastSelect"Comparing Physico-chemical Properties between Two Important Environmental Contaminant Lists
We study two different data sets contained in the CCD and observe how they relate and how they differ. The two data sets that we will explore are a water contaminant priority list and an air toxics list.
The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that “…are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems….” Moreover, this list “…was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants….” The National-Scale Air Toxics Assessments (NATA) is “… EPA’s ongoing comprehensive evaluation of air toxics in the United States… a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks… use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them….”
These lists can be found in the CCD at CCL4 with additional information at CCL4 information and NATADB with additional information at NATA information. The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links.
We explore details about these two lists of chemicals before diving into analyzing the data contained in each list.
options(width = 100)
ccl4_information <- get_public_chemical_list_by_name('CCL4')
print(ccl4_information, trunc.cols = TRUE)
#> visibility id type label
#> 1 PUBLIC 443 federal WATER|EPA: Chemical Contaminants - CCL 4
#> longDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that, at the time of publication, are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems. Contaminants listed on the CCL may require future regulation under the Safe Drinking Water Act (SDWA). EPA announced the <a href='https://www.epa.gov/ccl/contaminant-candidate-list-4-ccl-4-0' target='_blank'>fourth Drinking Water Contaminant Candidate List (CCL 4)</a> on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants. The group of cyanotoxins on CCL 4 includes, but is not limited to: anatoxin-a, cylindrospermopsin, microcystins, and saxitoxin. The CCL Chemical Candidate Lists are versioned iteratively and this description navigates between the various versions of the lists. The list of substances displayed below represents only the chemical CCL 4 contaminants. For the versioned lists, please use the hyperlinked lists below.<br/><br/>\r\n\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL5' target='_blank'>CCL5 - November 2022</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL4' target='_blank'>CCL4 - November 2016</a>\r\n This list<br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL3' target='_blank'>CCL3 - October 2009</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL2' target='_blank'>CCL2 - February 2005</a><br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL1' target='_blank'>CCL1 - March 1998</a><br/><br/>
#> listName chemicalCount createdAt updatedAt
#> 1 CCL4 100 2017-12-28T17:58:36Z 2022-10-26T21:14:27Z
#> shortDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that are known or anticipated to occur in public water systems. Version 4 is known as CCL 4.
natadb_information <- get_public_chemical_list_by_name('NATADB')
print(natadb_information, trunc.cols = TRUE)
#> visibility id type label
#> 1 PUBLIC 454 federal EPA: National-Scale Air Toxics Assessment (NATA)
#> longDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States. EPA developed the NATA as a state-of-the-science screening tool for State/Local/Tribal Agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks. NATA assessments do not incorporate refined information about emission sources but, rather, use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them.\r\n\r\nNATA provides estimates of the risk of cancer and other serious health effects from breathing (inhaling) air toxics in order to inform both national and more localized efforts to identify and prioritize air toxics, emission source types and locations which are of greatest potential concern in terms of contributing to population risk. This in turn helps air pollution experts focus limited analytical resources on areas and or populations where the potential for health risks are highest. Assessments include estimates of cancer and non-cancer health effects based on chronic exposure from outdoor sources, including assessments of non-cancer health effects for Diesel Particulate Matter (PM). Assessments provide a snapshot of the outdoor air quality and the risks to human health that would result if air toxic emissions levels remained unchanged.
#> listName chemicalCount createdAt updatedAt
#> 1 NATADB 163 2018-02-21T12:04:16Z 2018-11-16T21:42:01Z
#> shortDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States.Now we pull the actual chemicals contained in the lists using the APIs.
ccl4 <- get_chemicals_in_list('ccl4')
ccl4 <- data.table::as.data.table(ccl4)
natadb <- get_chemicals_in_list('NATADB')
natadb <- data.table::as.data.table(natadb)We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.
dim(ccl4)
#> [1] 100 37
dim(natadb)
#> [1] 163 37
colnames(ccl4)
#> [1] "id" "cpdataCount" "inchikey" "wikipediaArticle"
#> [5] "monoisotopicMass" "percentAssays" "pubchemCount" "pubmedCount"
#> [9] "sourcesCount" "qcLevel" "qcLevelDesc" "isotope"
#> [13] "multicomponent" "totalAssays" "pubchemCid" "relatedSubstanceCount"
#> [17] "relatedStructureCount" "casrn" "compoundId" "genericSubstanceId"
#> [21] "preferredName" "activeAssays" "molFormula" "hasStructureImage"
#> [25] "iupacName" "smiles" "inchiString" "qcNotes"
#> [29] "qsarReadySmiles" "msReadySmiles" "irisLink" "pprtvLink"
#> [33] "descriptorStringTsv" "isMarkush" "dtxsid" "dtxcid"
#> [37] "toxcastSelect"
head(ccl4, 1)
#> id cpdataCount inchikey wikipediaArticle monoisotopicMass percentAssays pubchemCount
#> <char> <int> <char> <char> <num> <num> <int>
#> 1: 2544 NA <NA> <NA> NA NA NA
#> 30 variables not shown: [pubmedCount <num>, sourcesCount <int>, qcLevel <int>, qcLevelDesc <char>, isotope <int>, multicomponent <int>, totalAssays <int>, pubchemCid <int>, relatedSubstanceCount <int>, relatedStructureCount <int>, ...]Accessing the Physico-chemical Property Data
Once we have the chemicals in each list, we access their physico-chemical properties. We will use the batch search forms of the function get_chem_info(), to which we supply a list of DTXSIDs.
ccl4$dtxsid
#> [1] "DTXSID001024118" "DTXSID0020153" "DTXSID0020446" "DTXSID0020573" "DTXSID0020600"
#> [6] "DTXSID0020814" "DTXSID0021464" "DTXSID0021541" "DTXSID0021917" "DTXSID0024052"
#> [11] "DTXSID0024341" "DTXSID0032578" "DTXSID1020437" "DTXSID1021407" "DTXSID1021409"
#> [16] "DTXSID1021740" "DTXSID1021798" "DTXSID1024174" "DTXSID1024207" "DTXSID1024338"
#> [21] "DTXSID1026164" "DTXSID1031040" "DTXSID1037484" "DTXSID1037486" "DTXSID1037567"
#> [26] "DTXSID2020684" "DTXSID2021028" "DTXSID2021317" "DTXSID2021731" "DTXSID2022333"
#> [31] "DTXSID2024169" "DTXSID2031083" "DTXSID2037506" "DTXSID2040282" "DTXSID2052156"
#> [36] "DTXSID3020203" "DTXSID3020702" "DTXSID3020833" "DTXSID3020964" "DTXSID3021857"
#> [41] "DTXSID3024366" "DTXSID3024869" "DTXSID3031864" "DTXSID3032464" "DTXSID3034458"
#> [46] "DTXSID3042219" "DTXSID3073137" "DTXSID3074313" "DTXSID4020533" "DTXSID4021503"
#> [51] "DTXSID4022361" "DTXSID4022367" "DTXSID4022448" "DTXSID4022991" "DTXSID4032611"
#> [56] "DTXSID4034948" "DTXSID5020023" "DTXSID5020576" "DTXSID5020601" "DTXSID5021207"
#> [61] "DTXSID5024182" "DTXSID5039224" "DTXSID50867064" "DTXSID6020301" "DTXSID6020856"
#> [66] "DTXSID6021030" "DTXSID6021032" "DTXSID6022422" "DTXSID6024177" "DTXSID6037483"
#> [71] "DTXSID6037485" "DTXSID6037568" "DTXSID7020005" "DTXSID7020215" "DTXSID7020637"
#> [76] "DTXSID7021029" "DTXSID7024241" "DTXSID7047433" "DTXSID8020044" "DTXSID8020090"
#> [81] "DTXSID8020597" "DTXSID8020832" "DTXSID8021062" "DTXSID8022292" "DTXSID8022377"
#> [86] "DTXSID8023846" "DTXSID8023848" "DTXSID8025541" "DTXSID8031865" "DTXSID8052483"
#> [91] "DTXSID9020243" "DTXSID9021390" "DTXSID9021427" "DTXSID9022366" "DTXSID9023380"
#> [96] "DTXSID9023914" "DTXSID9024142" "DTXSID9032113" "DTXSID9032119" "DTXSID9032329"
natadb$dtxsid
#> [1] "DTXSID0020153" "DTXSID0020448" "DTXSID0020523" "DTXSID0020529" "DTXSID0020600"
#> [6] "DTXSID0020868" "DTXSID0021381" "DTXSID0021383" "DTXSID0021541" "DTXSID0021834"
#> [11] "DTXSID0021917" "DTXSID0021965" "DTXSID0024187" "DTXSID0024260" "DTXSID0039227"
#> [16] "DTXSID0039229" "DTXSID00872421" "DTXSID1020148" "DTXSID1020273" "DTXSID1020302"
#> [21] "DTXSID1020306" "DTXSID1020431" "DTXSID1020437" "DTXSID1020512" "DTXSID1020516"
#> [26] "DTXSID1020566" "DTXSID1021374" "DTXSID1021798" "DTXSID1021827" "DTXSID1022057"
#> [31] "DTXSID1023786" "DTXSID1024045" "DTXSID1024382" "DTXSID1026164" "DTXSID1049641"
#> [36] "DTXSID10872417" "DTXSID2020137" "DTXSID2020262" "DTXSID2020507" "DTXSID2020682"
#> [41] "DTXSID2020688" "DTXSID2020711" "DTXSID2020844" "DTXSID2021105" "DTXSID2021157"
#> [46] "DTXSID2021159" "DTXSID2021284" "DTXSID2021286" "DTXSID2021319" "DTXSID2021446"
#> [51] "DTXSID2021658" "DTXSID2021731" "DTXSID2021781" "DTXSID2021993" "DTXSID3020203"
#> [56] "DTXSID3020257" "DTXSID3020413" "DTXSID3020415" "DTXSID3020596" "DTXSID3020679"
#> [61] "DTXSID3020702" "DTXSID3020833" "DTXSID3020964" "DTXSID3021431" "DTXSID3021932"
#> [66] "DTXSID3022455" "DTXSID3024366" "DTXSID3025091" "DTXSID3039242" "DTXSID30872414"
#> [71] "DTXSID30872419" "DTXSID4020161" "DTXSID4020298" "DTXSID4020402" "DTXSID4020533"
#> [76] "DTXSID4020583" "DTXSID4020874" "DTXSID4020901" "DTXSID4021006" "DTXSID4021056"
#> [81] "DTXSID4021395" "DTXSID4024143" "DTXSID4024359" "DTXSID4039231" "DTXSID40872425"
#> [86] "DTXSID5020023" "DTXSID5020027" "DTXSID5020029" "DTXSID5020071" "DTXSID5020316"
#> [91] "DTXSID5020449" "DTXSID5020491" "DTXSID5020601" "DTXSID5020607" "DTXSID5020865"
#> [96] "DTXSID5021124" "DTXSID5021207" "DTXSID5021380" "DTXSID5021386" "DTXSID5021889"
#> [101] "DTXSID5024055" "DTXSID5024059" "DTXSID5024267" "DTXSID5039224" "DTXSID6020145"
#> [106] "DTXSID6020307" "DTXSID6020353" "DTXSID6020432" "DTXSID6020438" "DTXSID6020515"
#> [111] "DTXSID6020569" "DTXSID6020981" "DTXSID6021828" "DTXSID6022422" "DTXSID6023947"
#> [116] "DTXSID6023949" "DTXSID7020005" "DTXSID7020009" "DTXSID7020267" "DTXSID7020637"
#> [121] "DTXSID7020683" "DTXSID7020687" "DTXSID7020689" "DTXSID7020710" "DTXSID7020716"
#> [126] "DTXSID7021029" "DTXSID7021100" "DTXSID7021106" "DTXSID7021318" "DTXSID7021360"
#> [131] "DTXSID7021368" "DTXSID7021948" "DTXSID7023984" "DTXSID7024166" "DTXSID7024370"
#> [136] "DTXSID7024532" "DTXSID7025180" "DTXSID7026156" "DTXSID8020090" "DTXSID8020173"
#> [141] "DTXSID8020250" "DTXSID8020597" "DTXSID8020599" "DTXSID8020759" "DTXSID8020832"
#> [146] "DTXSID8020913" "DTXSID8021195" "DTXSID8021197" "DTXSID8021432" "DTXSID8021434"
#> [151] "DTXSID8021438" "DTXSID8024286" "DTXSID8042476" "DTXSID9020168" "DTXSID9020243"
#> [156] "DTXSID9020247" "DTXSID9020293" "DTXSID9020299" "DTXSID9020827" "DTXSID9021138"
#> [161] "DTXSID9021261" "DTXSID9041522" "DTXSID90872415"
ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid)
natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid)Observe that this returns a single data.table for each query, and the data.table contains the physico-chemical properties available from the CompTox Chemicals Dashboard for each chemical in the query. Note, a warning message was triggered, Warning: Setting type to ''!, which indicates the the parameter type was not given a value. A default value is set within the function and more information can be found in the associated documentation. We examine the set of physico-chemical properties for the first chemical in CCL4.
Before any deeper analysis, let’s take a look at the dimensions of the data and the column names.
dim(ccl4_phys_chem)
#> [1] 9470 53
colnames(ccl4_phys_chem)
#> [1] "id" "smiles"
#> [3] "dtxcid" "dtxsid"
#> [5] "sourceName" "propValue"
#> [7] "lscitation" "propValueText"
#> [9] "expDetailsPh" "directUrl"
#> [11] "publicSourceUrl" "propValueId"
#> [13] "briefCitation" "propName"
#> [15] "propUnit" "propValueOriginal"
#> [17] "expDetailsTemperatureC" "expDetailsPressureMmhg"
#> [19] "publicSourceName" "publicSourceOriginalUrl"
#> [21] "expDetailsSpeciesLatin" "expDetailsResponseSite"
#> [23] "expDetailsSpeciesCommon" "sourceDescription"
#> [25] "publicSourceDescription" "expDetailsSpeciesSupercategory"
#> [27] "publicSourceOriginalDescription" "publicSourceOriginalName"
#> [29] "lsDoi" "lsName"
#> [31] "dataset" "lsCitation"
#> [33] "propType" "canonQsarSmiles"
#> [35] "genericSubstanceUpdatedAt" "propCategory"
#> [37] "propDescription" "modelName"
#> [39] "modelId" "propValueExperimental"
#> [41] "propValueExperimentalString" "propValueString"
#> [43] "propValueError" "adMethod"
#> [45] "adValue" "adConclusion"
#> [47] "adReasoning" "adMethodGlobal"
#> [49] "adValueGlobal" "adConclusionGlobal"
#> [51] "adReasoningGlobal" "hasQmrf"
#> [53] "qmrfUrl"Next, we display the unique values for the columns propertyID and propType.
ccl4_phys_chem[, unique(propName)]
#> [1] "Boiling Point" "Density"
#> [3] "Flash Point" "Henry's Law Constant"
#> [5] "LogKow: Octanol-Water" "Melting Point"
#> [7] "Vapor Pressure" "Water Solubility"
#> [9] "pKa Acidic Apparent" "Androgen Receptor Agonist"
#> [11] "Androgen Receptor Antagonist" "Androgen Receptor Binding"
#> [13] "Atmos. Hydroxylation Rate" "pKa Basic Apparent"
#> [15] "Bioconcentration Factor" "Biodeg. Half-Life"
#> [17] "Caco-2 Permeability (Papp)" "Dielectric Constant"
#> [19] "Estrogen Receptor Agonist" "Estrogen Receptor Antagonist"
#> [21] "Estrogen Receptor Binding" "Fish Biotrans. Half-Life (Km)"
#> [23] "Fraction Unbound in Human Plasma" "In Vitro Intrinsic Hepatic Clearance"
#> [25] "Index of Refraction" "Liquid Chromatography Retention Time"
#> [27] "LogD5.5" "LogD7.4"
#> [29] "LogKoa: Octanol-Air" "Molar Refractivity"
#> [31] "Molar Volume" "Oral Rat LD50"
#> [33] "Polarizability" "Ready Binary Biodegradability"
#> [35] "Soil Adsorp. Coeff. (Koc)" "Surface Tension"
ccl4_phys_chem[, unique(propType)]
#> [1] "predicted" "experimental"Let’s explore this further by examining the mean of the “boiling-point” and “melting-point” data.
ccl4_phys_chem[propName == 'Boiling Point', .(Mean = mean(propValue, na.rm = TRUE))]
#> Mean
#> <num>
#> 1: 237.9223
ccl4_phys_chem[propName == 'Boiling Point', .(Mean = mean(propValue, na.rm = TRUE)),
by = .(propType)]
#> propType Mean
#> <char> <num>
#> 1: predicted 232.7907
#> 2: experimental 249.5600
ccl4_phys_chem[propName == 'Melting Point', .(Mean = mean(propValue, na.rm = TRUE))]
#> Mean
#> <num>
#> 1: 50.29227
ccl4_phys_chem[propName == 'Melting Point', .(Mean = mean(propValue, na.rm = TRUE)),
by = .(propType)]
#> propType Mean
#> <char> <num>
#> 1: predicted 50.48745
#> 2: experimental 47.95455These results tell us about some of the reported physico-chemical properties of the data sets.
Answer to Environmental Health Question 1
With this, we can answer Environmental Health Question 1: After automatically pulling the fourth Drinking Water Contaminant Candidate List from the CompTox Chemicals Dashboard, list the properties and property types present in the data. What are the mean values for a specific property when grouped by property type and when ungrouped?
Answer: The mean “Boiling Point” is 237.9223 degrees Celsius for CCL4, with mean values of 249.5600 and 232.7907 for experimental and predicted, respectively. The mean “Melting Point” is 50.29227 degrees Celsius for CCL4, with mean values of 47.95455 and 50.48745 for experimental and predicted, respectively.
To explore all the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property (“Boiling Point”, “Melting Point”, etc…), and then additionally group those by property type (“experimental” vs “predicted”). In the grouping, we look at the columns propValue, unit, propName and propType. We also demonstrate how take the mean of the values for each grouping. We examine the chemical with DTXSID “DTXSID0020153” from CCL4.
head(ccl4_phys_chem[dtxsid == 'DTXSID0020153', ])
#> id smiles dtxcid dtxsid sourceName propValue lscitation
#> <int> <char> <char> <char> <char> <num> <char>
#> 1: 3611 ClCC1=CC=CC=C1 DTXCID00153 DTXSID0020153 eChemPortalAPI 179.4000 <NA>
#> 2: 3612 ClCC1=CC=CC=C1 DTXCID00153 DTXSID0020153 OChem_2024_04_03 179.0000 <NA>
#> 3: 3613 ClCC1=CC=CC=C1 DTXCID00153 DTXSID0020153 OChem_2024_04_03 179.0000 <NA>
#> 4: 3614 ClCC1=CC=CC=C1 DTXCID00153 DTXSID0020153 OPERA2.8 179.0000 <NA>
#> 5: 3615 ClCC1=CC=CC=C1 DTXCID00153 DTXSID0020153 PubChem_2024_11_27 174.0000 <NA>
#> 6: 3616 ClCC1=CC=CC=C1 DTXCID00153 DTXSID0020153 PubChem_2024_11_27 178.8889 <NA>
#> 46 variables not shown: [propValueText <char>, expDetailsPh <int>, directUrl <char>, publicSourceUrl <char>, propValueId <int>, briefCitation <char>, propName <char>, propUnit <char>, propValueOriginal <char>, expDetailsTemperatureC <int>, ...]
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(propType, propValue, propUnit),
by = .(propName)]
#> propName propType propValue propUnit
#> <char> <char> <num> <char>
#> 1: Boiling Point predicted 179.40000 °C
#> 2: Boiling Point predicted 179.00000 °C
#> 3: Boiling Point predicted 179.00000 °C
#> 4: Boiling Point predicted 179.00000 °C
#> 5: Boiling Point predicted 174.00000 °C
#> ---
#> 140: Oral Rat LD50 experimental 804.00000 mg/kg
#> 141: Polarizability experimental 14.27900 Å^3
#> 142: Ready Binary Biodegradabi experimental 0.00000 Binary 0/1
#> 143: Soil Adsorp. Coeff. (Koc) experimental 75.85776 L/kg
#> 144: Surface Tension experimental 33.85300 dyn/cm
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(propValue, propUnit),
by = .(propName, propType)]
#> propName propType propValue propUnit
#> <char> <char> <num> <char>
#> 1: Boiling Point predicted 1.794000e+02 °C
#> 2: Boiling Point predicted 1.790000e+02 °C
#> 3: Boiling Point predicted 1.790000e+02 °C
#> 4: Boiling Point predicted 1.790000e+02 °C
#> 5: Boiling Point predicted 1.740000e+02 °C
#> ---
#> 140: Surface Tension experimental 3.385300e+01 dyn/cm
#> 141: Vapor Pressure experimental 1.230269e+00 mmHg
#> 142: Vapor Pressure experimental 1.277000e+00 mmHg
#> 143: Water Solubility experimental 4.786301e-03 mol/L
#> 144: Water Solubility experimental 1.000000e-03 mol/L
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(Mean_value = sapply(.SD, function(t){mean(t, na.rm = TRUE)})),
by = .(propName, propUnit), .SDcols = c("propValue")]
#> propName propUnit Mean_value
#> <char> <char> <num>
#> 1: Boiling Point °C 1.785878e+02
#> 2: Density g/cm^3 1.098140e+00
#> 3: Flash Point °C 6.772668e+01
#> 4: Henry's Law Constant atm-m3/mole 7.608625e-04
#> 5: LogKow: Octanol-Water Log10 unitless 2.326000e+00
#> 6: Melting Point °C -4.213864e+01
#> 7: Vapor Pressure mmHg 1.231121e+00
#> 8: Water Solubility mol/L 4.690942e-01
#> 9: pKa Acidic Apparent Log10 unitless NaN
#> 10: Androgen Receptor Agonist Binary 0/1 0.000000e+00
#> 11: Androgen Receptor Antagon Binary 0/1 0.000000e+00
#> 12: Androgen Receptor Binding Binary 0/1 0.000000e+00
#> 13: Atmos. Hydroxylation Rate cm^3/molecule*sec 2.884032e-12
#> 14: pKa Basic Apparent Log10 unitless NaN
#> 15: Bioconcentration Factor L/kg 6.760830e+01
#> 16: Biodeg. Half-Life days 4.897788e+00
#> 17: Caco-2 Permeability (Papp cm/sec 5.011872e-05
#> 18: Dielectric Constant Dimensionless NaN
#> 19: Estrogen Receptor Agonist Binary 0/1 0.000000e+00
#> 20: Estrogen Receptor Antagon Binary 0/1 0.000000e+00
#> 21: Estrogen Receptor Binding Binary 0/1 0.000000e+00
#> 22: Fish Biotrans. Half-Life days 1.318257e-01
#> 23: Fraction Unbound in Human Dimensionless 2.000000e-01
#> 24: In Vitro Intrinsic Hepati uL/min/million hepatocyte 2.713000e+01
#> 25: Index of Refraction Dimensionless 1.527000e+00
#> 26: Liquid Chromatography Ret minutes 8.700000e+00
#> 27: LogD5.5 Log10 unitless 2.469000e+00
#> 28: LogD7.4 Log10 unitless 2.469000e+00
#> 29: LogKoa: Octanol-Air Log10 unitless 4.160000e+00
#> 30: Molar Refractivity cm^3/mol 3.601800e+01
#> 31: Molar Volume cm^3/mol 1.171290e+02
#> 32: Oral Rat LD50 mg/kg 8.040000e+02
#> 33: Polarizability Å^3 1.427900e+01
#> 34: Ready Binary Biodegradabi Binary 0/1 0.000000e+00
#> 35: Soil Adsorp. Coeff. (Koc) L/kg 7.585776e+01
#> 36: Surface Tension dyn/cm 3.385300e+01
#> propName propUnit Mean_value
ccl4_phys_chem[dtxsid == 'DTXSID0020153', .(Mean_value = sapply(.SD, function(t){mean(t, na.rm = TRUE)})),
by = .(propName, propUnit, propType),
.SDcols = c("propValue")][order(propName)]
#> propName propUnit propType Mean_value
#> <char> <char> <char> <num>
#> 1: Androgen Receptor Agonist Binary 0/1 experimental 0.000000e+00
#> 2: Androgen Receptor Antagon Binary 0/1 experimental 0.000000e+00
#> 3: Androgen Receptor Binding Binary 0/1 experimental 0.000000e+00
#> 4: Atmos. Hydroxylation Rate cm^3/molecule*sec experimental 2.884032e-12
#> 5: Bioconcentration Factor L/kg experimental 6.760830e+01
#> 6: Biodeg. Half-Life days experimental 4.897788e+00
#> 7: Boiling Point °C predicted 1.784519e+02
#> 8: Boiling Point °C experimental 1.791995e+02
#> 9: Caco-2 Permeability (Papp cm/sec experimental 5.011872e-05
#> 10: Density g/cm^3 predicted 1.100044e+00
#> 11: Density g/cm^3 experimental 1.081000e+00
#> 12: Dielectric Constant Dimensionless experimental NaN
#> 13: Estrogen Receptor Agonist Binary 0/1 experimental 0.000000e+00
#> 14: Estrogen Receptor Antagon Binary 0/1 experimental 0.000000e+00
#> 15: Estrogen Receptor Binding Binary 0/1 experimental 0.000000e+00
#> 16: Fish Biotrans. Half-Life days experimental 1.318257e-01
#> 17: Flash Point °C predicted 6.738433e+01
#> 18: Flash Point °C experimental 7.388900e+01
#> 19: Fraction Unbound in Human Dimensionless experimental 2.000000e-01
#> 20: Henry's Law Constant atm-m3/mole predicted 4.870692e-04
#> 21: Henry's Law Constant atm-m3/mole experimental 2.951209e-03
#> 22: In Vitro Intrinsic Hepati uL/min/million hepatocyte experimental 2.713000e+01
#> 23: Index of Refraction Dimensionless experimental 1.527000e+00
#> 24: Liquid Chromatography Ret minutes experimental 8.700000e+00
#> 25: LogD5.5 Log10 unitless experimental 2.469000e+00
#> 26: LogD7.4 Log10 unitless experimental 2.469000e+00
#> 27: LogKoa: Octanol-Air Log10 unitless experimental 4.160000e+00
#> 28: LogKow: Octanol-Water Log10 unitless predicted 2.300000e+00
#> 29: LogKow: Octanol-Water Log10 unitless experimental 2.469000e+00
#> 30: Melting Point °C predicted -4.202915e+01
#> 31: Melting Point °C experimental -4.400000e+01
#> 32: Molar Refractivity cm^3/mol experimental 3.601800e+01
#> 33: Molar Volume cm^3/mol experimental 1.171290e+02
#> 34: Oral Rat LD50 mg/kg experimental 8.040000e+02
#> 35: Polarizability Å^3 experimental 1.427900e+01
#> 36: Ready Binary Biodegradabi Binary 0/1 experimental 0.000000e+00
#> 37: Soil Adsorp. Coeff. (Koc) L/kg experimental 7.585776e+01
#> 38: Surface Tension dyn/cm experimental 3.385300e+01
#> 39: Vapor Pressure mmHg predicted 1.227658e+00
#> 40: Vapor Pressure mmHg experimental 1.253634e+00
#> 41: Water Solubility mol/L predicted 5.312543e-01
#> 42: Water Solubility mol/L experimental 2.893150e-03
#> 43: pKa Acidic Apparent Log10 unitless experimental NaN
#> 44: pKa Basic Apparent Log10 unitless experimental NaN
#> propName propUnit propType Mean_valueAnalyzing and Visualizing Physico-chemical Properties from Two Environmental Contaminant Lists
We consider exploring the differences in mean predicted and experimental values for a variety of physico-chemical properties in an effort to understand better the CCL4 and NATADB lists. In particular, we examine “Vapor Pressure”, “Henry’s Law Constant”, and “Boiling Point” and plot the means by chemical for these using boxplots. We then compare the values by grouping by both data set and propType value.
We first examine the vapor pressures for all the chemicals in each list. We then graph these, grouped by propType and pooled together in separate plots. For this we will use boxplots.
Group first by DTXSID.
ccl4_vapor_all <- ccl4_phys_chem[propName %in% 'Vapor Pressure',
.(mean_vapor_pressure = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]
natadb_vapor_all <- natadb_phys_chem[propName %in% 'Vapor Pressure',
.(mean_vapor_pressure = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]Then group by DTXSID and then by property type.
ccl4_vapor_grouped <- ccl4_phys_chem[propName %in% 'Vapor Pressure',
.(mean_vapor_pressure = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
by = .(dtxsid, propType)]
natadb_vapor_grouped <- natadb_phys_chem[propName %in% 'Vapor Pressure',
.(mean_vapor_pressure =
sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
by = .(dtxsid, propType)]Then examine the summary statistics of the data.
summary(ccl4_vapor_all)
#> dtxsid mean_vapor_pressure
#> Length:99 Min. : 0.00
#> Class :character 1st Qu.: 0.00
#> Mode :character Median : 0.00
#> Mean : 824.79
#> 3rd Qu.: 3.53
#> Max. :56601.99
#> NA's :5
summary(ccl4_vapor_grouped)
#> dtxsid propType mean_vapor_pressure
#> Length:174 Length:174 Min. : 0.00
#> Class :character Class :character 1st Qu.: 0.00
#> Mode :character Mode :character Median : 0.01
#> Mean : 646.97
#> 3rd Qu.: 5.76
#> Max. :65677.69
#> NA's :7
summary(natadb_vapor_all)
#> dtxsid mean_vapor_pressure
#> Length:154 Min. : 0.00
#> Class :character 1st Qu.: 0.01
#> Mode :character Median : 1.26
#> Mean : 1010.69
#> 3rd Qu.: 95.28
#> Max. :56601.99
summary(natadb_vapor_grouped)
#> dtxsid propType mean_vapor_pressure
#> Length:304 Length:304 Min. : 0.00
#> Class :character Class :character 1st Qu.: 0.01
#> Mode :character Mode :character Median : 1.29
#> Mean : 890.07
#> 3rd Qu.: 103.59
#> Max. :65677.69
#> NA's :3With such a large range of values covering several orders of magnitude, we log transform the data. Since some of these value are non-positive, some transformations may result in non-numeric values. These will be removed when plotting. We expect these values to be positive in general so we go ahead with these transformations.
ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
#> dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <num> <num>
#> 1: DTXSID0020153 1.231121e+00 0.2079252
#> 2: DTXSID0020446 4.627502e-07 -14.5860784
#> 3: DTXSID0020573 2.048720e-09 -20.0060505
#> 4: DTXSID0020600 1.240685e+03 7.1234187
#> 5: DTXSID0020814 1.080882e-08 -18.3429035
#> 6: DTXSID0021464 4.763659e-02 -3.0441542
#> 7: DTXSID0021541 3.979100e+03 8.2888110
#> 8: DTXSID0021917 1.465800e+02 4.9875712
#> 9: DTXSID0024052 3.204027e-07 -14.9536872
#> 10: DTXSID0024341 1.420000e-01 -1.9519282
#> 11: DTXSID0032578 3.894750e-05 -10.1532960
#> 12: DTXSID1020437 2.155800e+02 5.3733319
#> 13: DTXSID1021407 6.054889e-04 -7.4094743
#> 14: DTXSID1021409 2.482333e-07 -15.2088967
#> 15: DTXSID1021740 7.416327e+00 2.0036840
#> 16: DTXSID1021798 7.541886e-02 -2.5846979
#> 17: DTXSID1024174 4.270935e-06 -12.3636779
#> 18: DTXSID1024207 NaN NaN
#> 19: DTXSID1024338 7.011771e-08 -16.4730904
#> 20: DTXSID1026164 2.975461e-01 -1.2121860
#> 21: DTXSID1031040 0.000000e+00 -Inf
#> 22: DTXSID1037484 2.918868e-07 -15.0468999
#> 23: DTXSID1037486 2.508690e-07 -15.1983349
#> 24: DTXSID1037567 4.677351e-08 -16.8779487
#> 25: DTXSID2020684 1.566647e-03 -6.4588176
#> 26: DTXSID2021028 2.143330e+00 0.7623605
#> 27: DTXSID2021317 1.162623e+01 2.4532639
#> 28: DTXSID2021731 1.187968e+02 4.7774141
#> 29: DTXSID2022333 1.732809e+00 0.5497436
#> 30: DTXSID2024169 NaN NaN
#> 31: DTXSID2031083 2.238721e-09 -19.9173611
#> 32: DTXSID2037506 8.421185e-06 -11.6847600
#> 33: DTXSID2040282 NaN NaN
#> 34: DTXSID2052156 9.300947e-10 -20.7957347
#> 35: DTXSID3020203 1.986532e+03 7.5941457
#> 36: DTXSID3020702 1.336021e+01 2.5922806
#> 37: DTXSID3020833 2.454570e+02 5.5031219
#> 38: DTXSID3020964 2.320372e-01 -1.4608574
#> 39: DTXSID3021857 5.754399e-03 -5.1577906
#> 40: DTXSID3024366 5.666073e+01 4.0370813
#> 41: DTXSID3024869 1.010761e-01 -2.2918817
#> 42: DTXSID3031864 5.671483e-03 -5.1723046
#> 43: DTXSID3032464 3.736029e-06 -12.4974872
#> 44: DTXSID3034458 3.492983e-08 -17.1699246
#> 45: DTXSID3042219 3.879904e+00 1.3558104
#> 46: DTXSID3073137 NaN NaN
#> 47: DTXSID3074313 1.475665e-11 -24.9393272
#> 48: DTXSID4020533 3.588835e+01 3.5804128
#> 49: DTXSID4021503 1.354290e+02 4.9084476
#> 50: DTXSID4022361 1.866647e-06 -13.1913666
#> 51: DTXSID4022367 8.248239e-09 -18.6132661
#> 52: DTXSID4022448 2.587154e-05 -10.5623670
#> 53: DTXSID4022991 2.182579e-10 -22.2453437
#> 54: DTXSID4032611 4.887793e-04 -7.6235995
#> 55: DTXSID4034948 1.771162e-08 -17.8490448
#> 56: DTXSID5020023 2.466158e+02 5.5078318
#> 57: DTXSID5020576 2.240655e-09 -19.9164975
#> 58: DTXSID5020601 3.747130e+00 1.3209902
#> 59: DTXSID5021207 5.202845e+02 6.2543758
#> 60: DTXSID5024182 8.315175e+00 2.1180821
#> 61: DTXSID5039224 8.510203e+02 6.7464360
#> 62: DTXSID50867064 1.523564e-03 -6.4867028
#> 63: DTXSID6020301 7.462236e+03 8.9176103
#> 64: DTXSID6020856 3.157284e-01 -1.1528729
#> 65: DTXSID6021030 4.710846e-05 -9.9630580
#> 66: DTXSID6021032 6.854189e-01 -0.3777251
#> 67: DTXSID6022422 6.825767e-04 -7.2896356
#> 68: DTXSID6024177 2.643413e-02 -3.6330993
#> 69: DTXSID6037483 4.897788e-08 -16.8318970
#> 70: DTXSID6037485 5.011872e-08 -16.8088712
#> 71: DTXSID6037568 1.048258e-07 -16.0709660
#> 72: DTXSID7020005 1.386060e-01 -1.9761200
#> 73: DTXSID7020215 2.480000e-03 -5.9994967
#> 74: DTXSID7020637 5.660199e+04 10.9437995
#> 75: DTXSID7021029 3.588699e+00 1.2777897
#> 76: DTXSID7024241 1.658902e-06 -13.3093544
#> 77: DTXSID7047433 2.915760e-09 -19.6531354
#> 78: DTXSID8020044 2.340058e+01 3.1527609
#> 79: DTXSID8020090 5.431363e-01 -0.6103950
#> 80: DTXSID8020597 7.238398e-02 -2.6257702
#> 81: DTXSID8020832 3.599552e+03 8.1885647
#> 82: DTXSID8021062 9.393437e-02 -2.3651590
#> 83: DTXSID8022292 2.285895e-08 -17.5939230
#> 84: DTXSID8022377 5.079422e-09 -19.0980683
#> 85: DTXSID8023846 5.049610e-04 -7.5910293
#> 86: DTXSID8023848 9.256033e-06 -11.5902350
#> 87: DTXSID8025541 2.461996e-05 -10.6119530
#> 88: DTXSID8031865 5.472289e-01 -0.6028881
#> 89: DTXSID8052483 0.000000e+00 -Inf
#> 90: DTXSID9020243 7.906419e-05 -9.4452505
#> 91: DTXSID9021390 3.342026e+00 1.2065772
#> 92: DTXSID9021427 2.979643e-01 -1.2107817
#> 93: DTXSID9022366 7.831102e-10 -20.9677477
#> 94: DTXSID9023380 1.057641e-07 -16.0620544
#> 95: DTXSID9023914 1.672137e-04 -8.6962381
#> 96: DTXSID9024142 3.274791e-09 -19.5370119
#> 97: DTXSID9032113 1.177032e-08 -18.2576848
#> 98: DTXSID9032119 NaN NaN
#> 99: DTXSID9032329 6.988594e-07 -14.1738163
#> dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
ccl4_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 1.227658e+00 0.2051079
#> 2: DTXSID0020153 experimental 1.253634e+00 0.2260468
#> 3: DTXSID0020446 predicted 1.878700e-07 -15.4875156
#> 4: DTXSID0020446 experimental 1.974592e-06 -13.1351490
#> 5: DTXSID0020573 predicted 2.825234e-11 -24.2898450
#> ---
#> 170: DTXSID9032113 predicted 1.301896e-08 -18.1568589
#> 171: DTXSID9032113 experimental 6.775748e-09 -18.8099161
#> 172: DTXSID9032119 experimental NaN NaN
#> 173: DTXSID9032329 predicted 8.494045e-07 -13.9787303
#> 174: DTXSID9032329 experimental 3.977691e-07 -14.7373941
natadb_vapor_all[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
#> dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <num> <num>
#> 1: DTXSID0020153 1.231121e+00 0.2079252
#> 2: DTXSID0020448 5.632346e+01 4.0311111
#> 3: DTXSID0020523 2.723553e-04 -8.2084031
#> 4: DTXSID0020529 1.432584e-01 -1.9431052
#> 5: DTXSID0020600 1.240685e+03 7.1234187
#> ---
#> 150: DTXSID9020299 1.699063e-06 -13.2854336
#> 151: DTXSID9020827 2.985317e-06 -12.7218047
#> 152: DTXSID9021138 3.907251e-03 -5.5449212
#> 153: DTXSID9021261 7.500620e-04 -7.1953547
#> 154: DTXSID9041522 8.775678e-05 -9.3409414
natadb_vapor_grouped[, log_transform_mean_vapor_pressure :=
log(mean_vapor_pressure)]
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 1.227658e+00 0.2051079
#> 2: DTXSID0020153 experimental 1.253634e+00 0.2260468
#> 3: DTXSID0020448 predicted 5.663676e+01 4.0366582
#> 4: DTXSID0020448 experimental 5.381709e+01 3.9855911
#> 5: DTXSID0020523 predicted 2.446884e-04 -8.3155250
#> ---
#> 300: DTXSID9021138 predicted 3.359335e-03 -5.6960121
#> 301: DTXSID9021138 experimental 4.729125e-03 -5.3540151
#> 302: DTXSID9021261 predicted 7.500620e-04 -7.1953547
#> 303: DTXSID9021261 experimental NaN NaN
#> 304: DTXSID9041522 experimental 8.775678e-05 -9.3409414Now we plot the log transformed data.
First plot the CCL4 data.
ggplot(ccl4_vapor_all, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot() +
coord_flip()
#> Warning: Removed 7 rows containing non-finite outside the scale range (`stat_boxplot()`).
ggplot(ccl4_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
geom_boxplot()
#> Warning: Removed 9 rows containing non-finite outside the scale range (`stat_boxplot()`).
Then plot the NATA data.

ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) +
geom_boxplot()
#> Warning: Removed 3 rows containing non-finite outside the scale range (`stat_boxplot()`).
Finally, we compare both sets simultaneously. We add in a column to each data.table denoting to which data set the rows correspond and then combine the rows from both data sets together using the function rbind().
ccl4_vapor_grouped[, set := 'CCL4']
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 1.227658e+00 0.2051079 CCL4
#> 2: DTXSID0020153 experimental 1.253634e+00 0.2260468 CCL4
#> 3: DTXSID0020446 predicted 1.878700e-07 -15.4875156 CCL4
#> 4: DTXSID0020446 experimental 1.974592e-06 -13.1351490 CCL4
#> 5: DTXSID0020573 predicted 2.825234e-11 -24.2898450 CCL4
#> ---
#> 170: DTXSID9032113 predicted 1.301896e-08 -18.1568589 CCL4
#> 171: DTXSID9032113 experimental 6.775748e-09 -18.8099161 CCL4
#> 172: DTXSID9032119 experimental NaN NaN CCL4
#> 173: DTXSID9032329 predicted 8.494045e-07 -13.9787303 CCL4
#> 174: DTXSID9032329 experimental 3.977691e-07 -14.7373941 CCL4
natadb_vapor_grouped[, set := 'NATADB']
#> dtxsid propType mean_vapor_pressure log_transform_mean_vapor_pressure set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 1.227658e+00 0.2051079 NATADB
#> 2: DTXSID0020153 experimental 1.253634e+00 0.2260468 NATADB
#> 3: DTXSID0020448 predicted 5.663676e+01 4.0366582 NATADB
#> 4: DTXSID0020448 experimental 5.381709e+01 3.9855911 NATADB
#> 5: DTXSID0020523 predicted 2.446884e-04 -8.3155250 NATADB
#> ---
#> 300: DTXSID9021138 predicted 3.359335e-03 -5.6960121 NATADB
#> 301: DTXSID9021138 experimental 4.729125e-03 -5.3540151 NATADB
#> 302: DTXSID9021261 predicted 7.500620e-04 -7.1953547 NATADB
#> 303: DTXSID9021261 experimental NaN NaN NATADB
#> 304: DTXSID9041522 experimental 8.775678e-05 -9.3409414 NATADB
all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped)Now we plot the combined data. First we color the boxplots based on the property type, with mean log transformed vapor pressure plotted for each data set and property type.
vapor_box <- ggplot(all_vapor_grouped,
aes(set, log_transform_mean_vapor_pressure)) +
geom_boxplot(aes(color = propType))
vapor_box
#> Warning: Removed 12 rows containing non-finite outside the scale range (`stat_boxplot()`).
Next we color the boxplots based on the data set.
vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) +
geom_boxplot((aes(color = set))) +
coord_flip()
vapor
#> Warning: Removed 12 rows containing non-finite outside the scale range (`stat_boxplot()`).
In the plots above, when we graph the data separated both by data set and property type as well as just by data set, we observe the general trend that the NATADB chemicals have a higher mean vapor pressure than the CCL4 chemicals.
We also explore Henry’s Law constant and boiling point in a similar fashion.
Group by DTXSID.
ccl4_hlc_all <- ccl4_phys_chem[propName %in% "Henry's Law Constant",
.(mean_hlc = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]
natadb_hlc_all <- natadb_phys_chem[propName %in% "Henry's Law Constant",
.(mean_hlc = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]Group by DTXSID and property type.
ccl4_hlc_grouped <- ccl4_phys_chem[propName %in% "Henry's Law Constant",
.(mean_hlc = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
by = .(dtxsid, propType)]
natadb_hlc_grouped <- natadb_phys_chem[propName %in% "Henry's Law Constant",
.(mean_hlc = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
by = .(dtxsid, propType)]Examine summary statistics.
summary(ccl4_hlc_all)
#> dtxsid mean_hlc
#> Length:90 Min. :0.0000000
#> Class :character 1st Qu.:0.0000000
#> Mode :character Median :0.0000011
#> Mean :0.0210178
#> 3rd Qu.:0.0000966
#> Max. :1.5013626
summary(ccl4_hlc_grouped)
#> dtxsid propType mean_hlc
#> Length:157 Length:157 Min. :0.0000000
#> Class :character Class :character 1st Qu.:0.0000000
#> Mode :character Mode :character Median :0.0000015
#> Mean :0.0166674
#> 3rd Qu.:0.0001291
#> Max. :1.5473437
#> NA's :2
summary(natadb_hlc_all)
#> dtxsid mean_hlc
#> Length:152 Min. : 0.000
#> Class :character 1st Qu.: 0.000
#> Mode :character Median : 0.000
#> Mean : 25.195
#> 3rd Qu.: 0.002
#> Max. :3824.293
summary(natadb_hlc_grouped)
#> dtxsid propType mean_hlc
#> Length:274 Length:274 Min. : 0.000
#> Class :character Class :character 1st Qu.: 0.000
#> Mode :character Mode :character Median : 0.000
#> Mean : 14.855
#> 3rd Qu.: 0.002
#> Max. :4063.311Again, we log transform the data as it covers several orders of magnitude. We expect these values to be positive in general so we go ahead with these transformations.
ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid mean_hlc log_transform_mean_hlc
#> <char> <num> <num>
#> 1: DTXSID0020153 7.608625e-04 -7.1810579
#> 2: DTXSID0020446 5.917914e-08 -16.6426967
#> 3: DTXSID0020573 3.715352e-06 -12.5030371
#> 4: DTXSID0020600 1.493028e-04 -8.8095341
#> 5: DTXSID0020814 2.041738e-07 -15.4042943
#> 6: DTXSID0021464 1.678272e-08 -17.9029160
#> 7: DTXSID0021541 9.965029e-03 -4.6086735
#> 8: DTXSID0021917 1.501363e+00 0.4063731
#> 9: DTXSID0024052 9.335794e-11 -23.0945802
#> 10: DTXSID0032578 5.708485e-07 -14.3761419
#> 11: DTXSID1020437 5.910373e-03 -5.1310463
#> 12: DTXSID1021407 9.105946e-07 -13.9091681
#> 13: DTXSID1021409 4.290971e-07 -14.6615827
#> 14: DTXSID1021740 1.284597e-05 -11.2624803
#> 15: DTXSID1021798 2.071620e-06 -13.0871795
#> 16: DTXSID1024174 1.509734e-06 -13.4035773
#> 17: DTXSID1024338 3.665514e-09 -19.4242972
#> 18: DTXSID1026164 1.978642e-06 -13.1330997
#> 19: DTXSID1037484 1.202264e-09 -20.5390590
#> 20: DTXSID1037486 1.174898e-09 -20.5620849
#> 21: DTXSID1037567 4.365158e-10 -21.5521965
#> 22: DTXSID2020684 8.526210e-06 -11.6723656
#> 23: DTXSID2021028 1.677186e-04 -8.6932227
#> 24: DTXSID2021317 2.391691e-03 -6.0357547
#> 25: DTXSID2021731 1.585469e-05 -11.0520451
#> 26: DTXSID2022333 1.488889e-02 -4.2071400
#> 27: DTXSID2031083 5.623413e-11 -23.6014972
#> 28: DTXSID2037506 3.176631e-09 -19.5674445
#> 29: DTXSID2052156 3.801894e-10 -21.6903516
#> 30: DTXSID3020203 2.795640e-01 -1.2745241
#> 31: DTXSID3020702 6.168270e-07 -14.2986772
#> 32: DTXSID3020833 6.305823e-04 -7.3688669
#> 33: DTXSID3020964 4.483844e-05 -10.0124448
#> 34: DTXSID3021857 2.089296e-03 -6.1709280
#> 35: DTXSID3024366 1.476024e-04 -8.8209882
#> 36: DTXSID3024869 6.240907e-07 -14.2869701
#> 37: DTXSID3031864 1.819701e-11 -24.7297639
#> 38: DTXSID3032464 2.241663e-06 -13.0082925
#> 39: DTXSID3034458 8.128305e-09 -18.6279134
#> 40: DTXSID3042219 1.016034e-02 -4.5892630
#> 41: DTXSID3074313 2.041738e-11 -24.6146346
#> 42: DTXSID4020533 5.273916e-06 -12.1527374
#> 43: DTXSID4021503 1.902843e-03 -6.2644060
#> 44: DTXSID4022361 3.267786e-08 -17.2365681
#> 45: DTXSID4022367 1.047129e-09 -20.6772141
#> 46: DTXSID4022448 2.132508e-08 -17.6633821
#> 47: DTXSID4022991 1.230269e-11 -25.1212034
#> 48: DTXSID4032611 3.812004e-06 -12.4773554
#> 49: DTXSID4034948 5.061593e-09 -19.1015846
#> 50: DTXSID5020023 1.844509e-04 -8.5981272
#> 51: DTXSID5020576 9.332543e-08 -16.1871732
#> 52: DTXSID5020601 2.032776e-07 -15.4086934
#> 53: DTXSID5021207 8.762638e-05 -9.3424285
#> 54: DTXSID5024182 7.477348e-03 -4.8958771
#> 55: DTXSID5039224 7.530076e-05 -9.4940203
#> 56: DTXSID50867064 1.174898e-08 -18.2594998
#> 57: DTXSID6020301 3.266831e-02 -3.4213498
#> 58: DTXSID6020856 4.568074e-04 -7.6912486
#> 59: DTXSID6021030 7.565670e-04 -7.1867194
#> 60: DTXSID6021032 8.403150e-05 -9.3843188
#> 61: DTXSID6022422 5.360194e-08 -16.7416806
#> 62: DTXSID6024177 1.285277e-07 -15.8671213
#> 63: DTXSID6037483 5.495409e-10 -21.3219380
#> 64: DTXSID6037485 5.623413e-10 -21.2989121
#> 65: DTXSID6037568 8.317638e-09 -18.6048876
#> 66: DTXSID7020005 2.450917e-08 -17.5242187
#> 67: DTXSID7020637 4.882250e-07 -14.5324894
#> 68: DTXSID7021029 1.281830e-05 -11.2646365
#> 69: DTXSID7024241 1.297787e-06 -13.5548503
#> 70: DTXSID7047433 6.760830e-08 -16.5095351
#> 71: DTXSID8020044 4.981622e-06 -12.2097550
#> 72: DTXSID8020090 7.725040e-03 -4.8632882
#> 73: DTXSID8020597 2.350772e-07 -15.2633520
#> 74: DTXSID8020832 1.144615e-02 -4.4701020
#> 75: DTXSID8021062 3.866425e-06 -12.4631803
#> 76: DTXSID8022292 1.451240e-06 -13.4430924
#> 77: DTXSID8022377 3.715352e-06 -12.5030371
#> 78: DTXSID8023846 1.632925e-09 -20.2328929
#> 79: DTXSID8023848 1.032272e-08 -18.3889189
#> 80: DTXSID8025541 4.168694e-07 -14.6904929
#> 81: DTXSID8031865 9.965663e-05 -9.2137800
#> 82: DTXSID9020243 1.907474e-06 -13.1697307
#> 83: DTXSID9021390 3.108242e-04 -8.0762831
#> 84: DTXSID9021427 2.270435e-07 -15.2981240
#> 85: DTXSID9022366 5.128614e-09 -19.0884304
#> 86: DTXSID9023380 4.168694e-09 -19.2956631
#> 87: DTXSID9023914 5.003672e-11 -23.7182639
#> 88: DTXSID9024142 1.345777e-06 -13.5185387
#> 89: DTXSID9032113 7.734149e-08 -16.3750352
#> 90: DTXSID9032329 5.382367e-07 -14.4349674
#> dtxsid mean_hlc log_transform_mean_hlc
ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid propType mean_hlc log_transform_mean_hlc
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 4.870692e-04 -7.627104
#> 2: DTXSID0020153 experimental 2.951209e-03 -5.825540
#> 3: DTXSID0020446 predicted 7.359553e-08 -16.424682
#> 4: DTXSID0020446 experimental 1.513561e-09 -20.308801
#> 5: DTXSID0020573 experimental 3.715352e-06 -12.503037
#> ---
#> 153: DTXSID9024142 experimental 2.691535e-06 -12.825399
#> 154: DTXSID9032113 predicted 1.121455e-10 -22.911224
#> 155: DTXSID9032113 experimental 3.090295e-07 -14.989829
#> 156: DTXSID9032329 predicted 1.490847e-08 -18.021336
#> 157: DTXSID9032329 experimental 1.584893e-06 -13.354994
natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid mean_hlc log_transform_mean_hlc
#> <char> <num> <num>
#> 1: DTXSID0020153 7.608625e-04 -7.181058
#> 2: DTXSID0020448 2.710185e-03 -5.910739
#> 3: DTXSID0020523 1.068282e-07 -16.052044
#> 4: DTXSID0020529 2.793630e-07 -15.090754
#> 5: DTXSID0020600 1.493028e-04 -8.809534
#> ---
#> 148: DTXSID9020293 2.951209e-06 -12.733296
#> 149: DTXSID9020299 9.828547e-08 -16.135390
#> 150: DTXSID9020827 1.813010e-06 -13.220522
#> 151: DTXSID9021138 5.623413e-08 -16.693742
#> 152: DTXSID9041522 8.912509e-06 -11.628055
natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#> dtxsid propType mean_hlc log_transform_mean_hlc
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 4.870692e-04 -7.627104
#> 2: DTXSID0020153 experimental 2.951209e-03 -5.825540
#> 3: DTXSID0020448 predicted 2.707002e-03 -5.911913
#> 4: DTXSID0020448 experimental 2.818383e-03 -5.871592
#> 5: DTXSID0020523 predicted 1.107746e-07 -16.015769
#> ---
#> 270: DTXSID9020299 experimental 4.897788e-10 -21.437067
#> 271: DTXSID9020827 predicted 2.134778e-06 -13.057148
#> 272: DTXSID9020827 experimental 2.041738e-07 -15.404294
#> 273: DTXSID9021138 experimental 5.623413e-08 -16.693742
#> 274: DTXSID9041522 experimental 8.912509e-06 -11.628055We compare both sets simultaneously. We add in a column to each data.table denoting to which set the rows correspond and then rbind() the rows together.
Label and combine data.
ccl4_hlc_grouped[, set := 'CCL4']
#> dtxsid propType mean_hlc log_transform_mean_hlc set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 4.870692e-04 -7.627104 CCL4
#> 2: DTXSID0020153 experimental 2.951209e-03 -5.825540 CCL4
#> 3: DTXSID0020446 predicted 7.359553e-08 -16.424682 CCL4
#> 4: DTXSID0020446 experimental 1.513561e-09 -20.308801 CCL4
#> 5: DTXSID0020573 experimental 3.715352e-06 -12.503037 CCL4
#> ---
#> 153: DTXSID9024142 experimental 2.691535e-06 -12.825399 CCL4
#> 154: DTXSID9032113 predicted 1.121455e-10 -22.911224 CCL4
#> 155: DTXSID9032113 experimental 3.090295e-07 -14.989829 CCL4
#> 156: DTXSID9032329 predicted 1.490847e-08 -18.021336 CCL4
#> 157: DTXSID9032329 experimental 1.584893e-06 -13.354994 CCL4
natadb_hlc_grouped[, set := 'NATADB']
#> dtxsid propType mean_hlc log_transform_mean_hlc set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 4.870692e-04 -7.627104 NATADB
#> 2: DTXSID0020153 experimental 2.951209e-03 -5.825540 NATADB
#> 3: DTXSID0020448 predicted 2.707002e-03 -5.911913 NATADB
#> 4: DTXSID0020448 experimental 2.818383e-03 -5.871592 NATADB
#> 5: DTXSID0020523 predicted 1.107746e-07 -16.015769 NATADB
#> ---
#> 270: DTXSID9020299 experimental 4.897788e-10 -21.437067 NATADB
#> 271: DTXSID9020827 predicted 2.134778e-06 -13.057148 NATADB
#> 272: DTXSID9020827 experimental 2.041738e-07 -15.404294 NATADB
#> 273: DTXSID9021138 experimental 5.623413e-08 -16.693742 NATADB
#> 274: DTXSID9041522 experimental 8.912509e-06 -11.628055 NATADB
all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped)Plot data. Some rows are removed due to transformations above that result in non-valid values.
hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) +
geom_boxplot(aes(color = propType))
hlc_box
#> Warning: Removed 2 rows containing non-finite outside the scale range (`stat_boxplot()`).
hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) +
geom_boxplot(aes(color = set)) +
coord_flip()
hlc
#> Warning: Removed 2 rows containing non-finite outside the scale range (`stat_boxplot()`).
Again, we observe that in both grouping by propType and aggregating all results together by data set, that the chemicals in NATADB have a generally higher mean Henry’s Law Constant value than those in CCL4.
Finally, we consider boiling point.
Group by DTXSID.
ccl4_boiling_all <- ccl4_phys_chem[propName %in% 'Boiling Point',
.(mean_boiling_point = sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]
natadb_boiling_all <- natadb_phys_chem[propName %in% 'Boiling Point',
.(mean_boiling_point =
sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'), by = .(dtxsid)]Group by DTXSID and property type.
ccl4_boiling_grouped <- ccl4_phys_chem[propName %in% 'Boiling Point',
.(mean_boiling_point =
sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
by = .(dtxsid, propType)]
natadb_boiling_grouped <- natadb_phys_chem[propName %in% 'Boiling Point',
.(mean_boiling_point =
sapply(.SD, function(t) {mean(t, na.rm = TRUE)})),
.SDcols = c('propValue'),
by = .(dtxsid, propType)]Calculate summary statistics.
summary(ccl4_boiling_all)
#> dtxsid mean_boiling_point
#> Length:99 Min. : -40.49
#> Class :character 1st Qu.: 173.57
#> Mode :character Median : 270.50
#> Mean : 372.31
#> 3rd Qu.: 378.49
#> Max. :4745.20
#> NA's :2
summary(ccl4_boiling_grouped)
#> dtxsid propType mean_boiling_point
#> Length:158 Length:158 Min. : -40.71
#> Class :character Class :character 1st Qu.: 117.76
#> Mode :character Mode :character Median : 210.70
#> Mean : 299.05
#> 3rd Qu.: 346.45
#> Max. :4745.20
#> NA's :7
summary(natadb_boiling_all)
#> dtxsid mean_boiling_point
#> Length:155 Min. :-87.60
#> Class :character 1st Qu.: 78.45
#> Mode :character Median :180.75
#> Mean :178.62
#> 3rd Qu.:268.35
#> Max. :685.00
summary(natadb_boiling_grouped)
#> dtxsid propType mean_boiling_point
#> Length:298 Length:298 Min. :-87.64
#> Class :character Class :character 1st Qu.: 74.68
#> Mode :character Mode :character Median :177.59
#> Mean :171.81
#> 3rd Qu.:255.22
#> Max. :685.00
#> NA's :3Since some of the boiling point values have negative values, we cannot log transform these values. If we try, as you will see below, there will be warnings of NaNs produced.
ccl4_boiling_all[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#> dtxsid mean_boiling_point log_transform
#> <char> <num> <num>
#> 1: DTXSID0020153 178.587788 5.185080
#> 2: DTXSID0020446 312.294000 5.743945
#> 3: DTXSID0020573 410.789000 6.018080
#> 4: DTXSID0020600 10.694795 2.369757
#> 5: DTXSID0020814 394.674500 5.978061
#> 6: DTXSID0021464 177.000000 5.176150
#> 7: DTXSID0021541 -24.134679 NaN
#> 8: DTXSID0021917 68.796359 4.231151
#> 9: DTXSID0024052 386.788500 5.957878
#> 10: DTXSID0024341 193.018519 5.262786
#> 11: DTXSID0032578 382.420000 5.946519
#> 12: DTXSID1020437 57.244333 4.047329
#> 13: DTXSID1021407 253.445500 5.535149
#> 14: DTXSID1021409 428.540000 6.060384
#> 15: DTXSID1021740 117.692246 4.768073
#> 16: DTXSID1021798 237.331678 5.469459
#> 17: DTXSID1024174 320.600667 5.770196
#> 18: DTXSID1024207 4745.200000 8.464889
#> 19: DTXSID1024338 265.000000 5.579730
#> 20: DTXSID1026164 201.854624 5.307548
#> 21: DTXSID1031040 2870.000000 7.962067
#> 22: DTXSID1037484 355.717000 5.874135
#> 23: DTXSID1037486 356.701000 5.876898
#> 24: DTXSID1037567 314.000000 5.749393
#> 25: DTXSID2020684 270.666333 5.600887
#> 26: DTXSID2021028 175.747349 5.169047
#> 27: DTXSID2021317 130.753506 4.873314
#> 28: DTXSID2021731 72.591563 4.284849
#> 29: DTXSID2022333 173.565800 5.156557
#> 30: DTXSID2024169 1962.111111 7.581776
#> 31: DTXSID2031083 265.000000 5.579730
#> 32: DTXSID2037506 328.332500 5.794027
#> 33: DTXSID2040282 NaN NaN
#> 34: DTXSID2052156 410.311000 6.016915
#> 35: DTXSID3020203 -4.346433 NaN
#> 36: DTXSID3020702 115.631944 4.750412
#> 37: DTXSID3020833 55.103032 4.009205
#> 38: DTXSID3020964 210.604407 5.349982
#> 39: DTXSID3021857 281.000000 5.638355
#> 40: DTXSID3024366 89.343205 4.492485
#> 41: DTXSID3024869 207.710000 5.336143
#> 42: DTXSID3031864 242.666667 5.491689
#> 43: DTXSID3032464 382.914500 5.947812
#> 44: DTXSID3034458 404.806000 6.003408
#> 45: DTXSID3042219 159.151500 5.069857
#> 46: DTXSID3073137 NaN NaN
#> 47: DTXSID3074313 423.281500 6.048037
#> 48: DTXSID4020533 101.287786 4.617966
#> 49: DTXSID4021503 67.711957 4.215263
#> 50: DTXSID4022361 364.965000 5.899801
#> 51: DTXSID4022367 391.085000 5.968925
#> 52: DTXSID4022448 344.409500 5.841831
#> 53: DTXSID4022991 603.700500 6.403078
#> 54: DTXSID4032611 280.169000 5.635393
#> 55: DTXSID4034948 329.000000 5.796058
#> 56: DTXSID5020023 52.685656 3.964343
#> 57: DTXSID5020576 423.611500 6.048817
#> 58: DTXSID5020601 297.406250 5.695099
#> 59: DTXSID5021207 34.225691 3.532977
#> 60: DTXSID5024182 124.452605 4.823925
#> 61: DTXSID5039224 20.316171 3.011417
#> 62: DTXSID50867064 270.497500 5.600263
#> 63: DTXSID6020301 -40.485711 NaN
#> 64: DTXSID6020856 202.478620 5.310634
#> 65: DTXSID6021030 215.131750 5.371251
#> 66: DTXSID6021032 206.036870 5.328055
#> 67: DTXSID6022422 398.217426 5.986998
#> 68: DTXSID6024177 223.850000 5.410976
#> 69: DTXSID6037483 315.000000 5.752573
#> 70: DTXSID6037485 315.000000 5.752573
#> 71: DTXSID6037568 365.223000 5.900508
#> 72: DTXSID7020005 221.587122 5.400816
#> 73: DTXSID7020215 268.000000 5.590987
#> 74: DTXSID7020637 -9.793602 NaN
#> 75: DTXSID7021029 152.598654 5.027811
#> 76: DTXSID7024241 365.386750 5.900956
#> 77: DTXSID7047433 399.043000 5.989069
#> 78: DTXSID8020044 96.330772 4.567788
#> 79: DTXSID8020090 184.070222 5.215317
#> 80: DTXSID8020597 197.400567 5.285235
#> 81: DTXSID8020832 3.601822 1.281440
#> 82: DTXSID8021062 213.962963 5.365803
#> 83: DTXSID8022292 355.628333 5.873886
#> 84: DTXSID8022377 409.458500 6.014836
#> 85: DTXSID8023846 287.000000 5.659482
#> 86: DTXSID8023848 287.818667 5.662331
#> 87: DTXSID8025541 339.798000 5.828351
#> 88: DTXSID8031865 190.491583 5.249608
#> 89: DTXSID8052483 2700.000000 7.901007
#> 90: DTXSID9020243 308.081500 5.730364
#> 91: DTXSID9021390 156.429344 5.052604
#> 92: DTXSID9021427 184.386569 5.217034
#> 93: DTXSID9022366 425.510000 6.053288
#> 94: DTXSID9023380 378.490000 5.936190
#> 95: DTXSID9023914 369.577000 5.912359
#> 96: DTXSID9024142 427.009000 6.056805
#> 97: DTXSID9032113 399.426500 5.990030
#> 98: DTXSID9032119 1089.950000 6.993887
#> 99: DTXSID9032329 436.091500 6.077852
#> dtxsid mean_boiling_point log_transform
ccl4_boiling_grouped[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#> dtxsid propType mean_boiling_point log_transform
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 178.4519 5.184319
#> 2: DTXSID0020153 experimental 179.1995 5.188500
#> 3: DTXSID0020446 predicted 270.5000 5.600272
#> 4: DTXSID0020446 experimental 354.0880 5.869545
#> 5: DTXSID0020573 predicted 413.4500 6.024537
#> ---
#> 154: DTXSID9024142 experimental 501.5135 6.217631
#> 155: DTXSID9032113 experimental 399.4265 5.990030
#> 156: DTXSID9032119 predicted 1089.9500 6.993887
#> 157: DTXSID9032119 experimental NaN NaN
#> 158: DTXSID9032329 experimental 436.0915 6.077852
natadb_boiling_all[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#> dtxsid mean_boiling_point log_transform
#> <char> <num> <num>
#> 1: DTXSID0020153 178.58779 5.185080
#> 2: DTXSID0020448 95.90601 4.563369
#> 3: DTXSID0020523 331.04850 5.802265
#> 4: DTXSID0020529 299.97740 5.703707
#> 5: DTXSID0020600 10.69479 2.369757
#> ---
#> 151: DTXSID9020299 401.17400 5.994395
#> 152: DTXSID9020827 376.07600 5.929791
#> 153: DTXSID9021138 267.87083 5.590505
#> 154: DTXSID9021261 685.00000 6.529419
#> 155: DTXSID9041522 339.85714 5.828525
natadb_boiling_grouped[, log_transform := log(mean_boiling_point)]
#> Warning in log(mean_boiling_point): NaNs produced
#> dtxsid propType mean_boiling_point log_transform
#> <char> <char> <num> <num>
#> 1: DTXSID0020153 predicted 178.45185 5.184319
#> 2: DTXSID0020153 experimental 179.19950 5.188500
#> 3: DTXSID0020448 predicted 96.07302 4.565108
#> 4: DTXSID0020448 experimental 94.73700 4.551105
#> 5: DTXSID0020523 experimental 331.04850 5.802265
#> ---
#> 294: DTXSID9021138 experimental 267.19650 5.587984
#> 295: DTXSID9021261 predicted 685.00000 6.529419
#> 296: DTXSID9021261 experimental NaN NaN
#> 297: DTXSID9041522 predicted 340.00000 5.828946
#> 298: DTXSID9041522 experimental 339.50000 5.827474We compare both sets simultaneously. We add in a column to each data.table denoting to which set the rows correspond and then rbind() the rows together. We use the values as is rather than transforming them.
Label and combine data.
ccl4_boiling_grouped[, set := 'CCL4']
#> dtxsid propType mean_boiling_point log_transform set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 178.4519 5.184319 CCL4
#> 2: DTXSID0020153 experimental 179.1995 5.188500 CCL4
#> 3: DTXSID0020446 predicted 270.5000 5.600272 CCL4
#> 4: DTXSID0020446 experimental 354.0880 5.869545 CCL4
#> 5: DTXSID0020573 predicted 413.4500 6.024537 CCL4
#> ---
#> 154: DTXSID9024142 experimental 501.5135 6.217631 CCL4
#> 155: DTXSID9032113 experimental 399.4265 5.990030 CCL4
#> 156: DTXSID9032119 predicted 1089.9500 6.993887 CCL4
#> 157: DTXSID9032119 experimental NaN NaN CCL4
#> 158: DTXSID9032329 experimental 436.0915 6.077852 CCL4
natadb_boiling_grouped[, set := 'NATADB']
#> dtxsid propType mean_boiling_point log_transform set
#> <char> <char> <num> <num> <char>
#> 1: DTXSID0020153 predicted 178.45185 5.184319 NATADB
#> 2: DTXSID0020153 experimental 179.19950 5.188500 NATADB
#> 3: DTXSID0020448 predicted 96.07302 4.565108 NATADB
#> 4: DTXSID0020448 experimental 94.73700 4.551105 NATADB
#> 5: DTXSID0020523 experimental 331.04850 5.802265 NATADB
#> ---
#> 294: DTXSID9021138 experimental 267.19650 5.587984 NATADB
#> 295: DTXSID9021261 predicted 685.00000 6.529419 NATADB
#> 296: DTXSID9021261 experimental NaN NaN NATADB
#> 297: DTXSID9041522 predicted 340.00000 5.828946 NATADB
#> 298: DTXSID9041522 experimental 339.50000 5.827474 NATADB
all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped)Plot the data.
boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) +
geom_boxplot(aes(color = propType))
boiling_box
#> Warning: Removed 10 rows containing non-finite outside the scale range (`stat_boxplot()`).
boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) +
geom_boxplot(aes(color = set)) +
coord_flip()
boiling
#> Warning: Removed 10 rows containing non-finite outside the scale range (`stat_boxplot()`).
A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the experimental values for each data set tend to be higher than the predicted. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two data sets as it was in the previous cases of physico-chemical properties we investigated.
Answer to Environmental Health Question 2
Through inspecting the last several plots, we can answer Environmental Health Question 2: The physico-chemical property data are reported with both experimental and predicted values present for many chemicals. Are there differences between the mean predicted and experimental results for a variety of physico-chemical properties?
Answer: There are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental. In the case of “Vapor Pressure”, the means of experimental values tend to be a little lower than predicted, though they are much closer in the case of NATADB than CCL4. The trend of lower predicted means compared to experimental means is more clearly demonstrated for “Henry’s Law Constant” values in both data sets. In the case of “Boiling Point”, the experimental values are greater than the predicted values, though this is much more pronounced in CCL4 while the set of means for NATADB are again fairly close.
Hazard Data: Genotoxicity
Now, having examined some of the distributions of the physico-chemical properties of the two lists, aggregated between predicted and experimental, we move towards learning more about these chemicals beyond physico-chemical properties. Specifically, we will examine their genotoxicity.
Using the standard CompTox Chemicals Dashboard approach to access genotoxicity, one would again navigate to the individual chemical page

Once one navigates to the genotoxicity tab highlighted in the previous page, the following is displayed as seen here:

This page includes two sets of information, the first of which provides a summary of available genotoxicity data while the second provides the individual reports and samples of such data.
We again use the CTX APIs to streamline the process of retrieving this information in a programmatic fashion. To this end, we will use the genotoxicity endpoints found within the Hazard endpoints of the CTX APIs. Pictured below is the particular set of genotoxicity resources available in the Hazard endpoints of the CTX APIs.

There are both summary and detail resources, reflecting the information one can find on the CompTox Chemicals Dashboard Genotoxicity page for a given chemical.
To access the genetox endpoint, we will use the function get_genetox_summary(). Since we have a list of chemicals, rather than searching individually for each chemical, we use the batch search version of the function, named get_genetox_summary_batch(). We will examine this and then access the details.
Grab the data using the APIs.
ccl4_genotox <- get_genetox_summary_batch(DTXSID = ccl4$dtxsid)
natadb_genetox <- get_genetox_summary_batch(DTXSID = natadb$dtxsid)Examine the dimensions.
Examine the column names and data from the first six chemicals with genetox data from CCL4.
colnames(ccl4_genotox)
#> [1] "id" "dtxsid" "reportsPositive" "reportsNegative" "reportsOther"
#> [6] "ames" "micronucleus" "clowderDocId" "genetoxCall" "genetoxSummaryId"
head(ccl4_genotox)
#> id dtxsid reportsPositive reportsNegative reportsOther ames micronucleus
#> <int> <char> <int> <int> <int> <char> <char>
#> 1: 92 DTXSID0020153 20 5 1 positive positive
#> 2: 4399 DTXSID0020446 0 8 0 negative negative
#> 3: 930 DTXSID0020573 3 9 0 negative negative
#> 4: 93 DTXSID0020600 20 0 1 positive positive
#> 5: 2079 DTXSID0020814 1 0 0 <NA> <NA>
#> 6: 320 DTXSID0021464 8 6 0 positive positive
#> 3 variables not shown: [clowderDocId <char>, genetoxCall <char>, genetoxSummaryId <int>]The information returned is of the first variety highlighted in the image above, that is, the summary data on the available genotoxicity data for each chemical.
Observe that we have information on 71 chemicals from the CCL4 data and 153 from the NATA data. We note the chemicals not included in the results and then dig into the returned results.
ccl4[!(dtxsid %in% ccl4_genotox$dtxsid),
.(dtxsid, casrn, preferredName, molFormula)]
#> dtxsid casrn preferredName molFormula
#> <char> <char> <char> <char>
#> 1: DTXSID001024118 77238-39-2 Microcystin <NA>
#> 2: DTXSID0024052 55290-64-7 Dimethipin C6H10O4S2
#> 3: DTXSID0032578 59669-26-0 Thiodicarb C10H18N4O4S3
#> 4: DTXSID1037484 194992-44-4 Acetochlor OA C14H19NO4
#> 5: DTXSID1037486 171262-17-2 2-[(2,6-Diethylphenyl)(me C14H19NO4
#> 6: DTXSID1037567 171118-09-5 Metolachlor ESA C15H23NO5S
#> 7: DTXSID2022333 135-98-8 sec-Butylbenzene C10H14
#> 8: DTXSID2031083 143545-90-8 Cylindrospermopsin C15H21N5O7S
#> 9: DTXSID2037506 16655-82-6 3-Hydroxycarbofuran C12H15NO4
#> 10: DTXSID2052156 517-09-9 Equilenin C18H18O2
#> 11: DTXSID3021857 25154-52-3 Nonylphenol C15H24O
#> 12: DTXSID3034458 99129-21-2 Clethodim C17H26ClNO3S
#> 13: DTXSID3042219 103-65-1 Propylbenzene C9H12
#> 14: DTXSID3073137 14866-68-3 Chlorate ClO3
#> 15: DTXSID3074313 35523-89-8 Saxitoxin C10H17N7O4
#> 16: DTXSID4022448 51218-45-2 Metolachlor C15H22ClNO2
#> 17: DTXSID4032611 13194-48-4 Ethoprop C8H19O2PS2
#> 18: DTXSID4034948 112410-23-8 Tebufenozide C22H28N2O2
#> 19: DTXSID50867064 64285-06-9 Anatoxin a C10H15NO
#> 20: DTXSID6024177 10265-92-6 Methamidophos C2H8NO2PS
#> 21: DTXSID6037483 187022-11-3 Acetochlor ESA C14H21NO5S
#> 22: DTXSID6037485 142363-53-9 Alachlor ESA C14H21NO5S
#> 23: DTXSID6037568 152019-73-3 Metolachlor OA C15H21NO4
#> 24: DTXSID7024241 42874-03-3 Oxyfluorfen C15H11ClF3NO4
#> 25: DTXSID7047433 474-86-2 Equilin C18H20O2
#> 26: DTXSID8022377 57-91-0 17alpha-Estradiol C18H24O2
#> 27: DTXSID8052483 7440-56-4 Germanium Ge
#> 28: DTXSID9032113 107534-96-3 Tebuconazole C16H22ClN3O
#> 29: DTXSID9032329 741-58-2 Bensulide C14H24NO4PS3
#> dtxsid casrn preferredName molFormula
natadb[!(dtxsid %in% natadb_genetox$dtxsid),
.(dtxsid, casrn, preferredName, molFormula)]
#> dtxsid casrn preferredName molFormula
#> <char> <char> <char> <char>
#> 1: DTXSID00872421 NOCAS_872421 Lead & Lead Compounds <NA>
#> 2: DTXSID1020273 7782-50-5 Chlorine Cl2
#> 3: DTXSID10872417 NOCAS_872417 Cadmium & Cadmium Compoun <NA>
#> 4: DTXSID30872414 NOCAS_872414 Antimony & Antimony Compo <NA>
#> 5: DTXSID30872419 NOCAS_872419 Cobalt & Cobalt Compounds <NA>
#> 6: DTXSID40872425 NOCAS_872425 Nickel & Nickel Compounds <NA>
#> 7: DTXSID5024267 1336-36-3 Polychlorinated biphenyls C12
#> 8: DTXSID7020687 608-73-1 1,2,3,4,5,6-Hexachlorocyc C6H6Cl6
#> 9: DTXSID7023984 NOCAS_23984 Coke oven emissions <NA>
#> 10: DTXSID90872415 NOCAS_872415 Arsenic & Arsenic Compoun <NA>Now, we access the genotoxicity details of the chemicals in each data set using the function get_genetox_details_batch(). We explore the dimensions of the returned queries, the column names, and the first few lines of the data.
Grab the data from the CTX APIs.
ccl4_genetox_details <- get_genetox_details_batch(DTXSID = ccl4$dtxsid)
natadb_genetox_details <- get_genetox_details_batch(DTXSID = natadb$dtxsid)Examine the dimensions.
Look at the column names and the first six rows of the data from the CCL4 chemicals.
colnames(ccl4_genetox_details)
#> [1] "id" "source" "year" "dtxsid"
#> [5] "strain" "species" "assayCategory" "assayType"
#> [9] "metabolicActivation" "assayResult"
head(ccl4_genetox_details)
#> Empty data.table (0 rows and 10 cols): id,source,year,dtxsid,strain,species...We examine the information returned for the first chemical in each set of results, which is DTXSID0020153. Notice that the information is identical in each case as this information is chemical specific and not data set specific.
Look at the dimensions first.
dim(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#> [1] 0 10
dim(natadb_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#> [1] 0 10Now examine the first few rows.
head(ccl4_genetox_details[dtxsid %in% 'DTXSID0020153', ])
#> Empty data.table (0 rows and 10 cols): id,source,year,dtxsid,strain,species...Observe that the data is the same for each data set when restricting to the same chemical. This is because the information we are retrieving is specific to the chemical and not dependent on the chemical lists to which the chemical may belong.
We now explore the assays present for chemicals in each data set. We first determine the unique values of the assayCategory column and then group by these values and determine the number of unique assays for each assayCategory value.
Determine the unique assay categories.
ccl4_genetox_details[, unique(assayCategory)]
#> character(0)
natadb_genetox_details[, unique(assayCategory)]
#> character(0)Determine the unique assays for each data set and list them.
ccl4_genetox_details[, unique(assayType)]
#> character(0)
natadb_genetox_details[, unique(assayType)]
#> character(0)Determine the number of assays per unique assayCategory value.
ccl4_genetox_details[, .(Assays = length(unique(assayType))),
by = .(assayCategory)]
#> Empty data.table (0 rows and 2 cols): assayCategory,Assays
natadb_genetox_details[, .(Assays = length(unique(assayType))),
by = .(assayCategory)]
#> Empty data.table (0 rows and 2 cols): assayCategory,AssaysWe can analyze these results more closely, counting the number of assay results and grouping by assayCategory, and assayType. We also examine the different numbers of assayCategory and assayTypes values used.
ccl4_genetox_details[, .N, by = .(assayCategory, assayType, assayResult)]
#> Empty data.table (0 rows and 4 cols): assayCategory,assayType,assayResult,N
ccl4_genetox_details[, .N, by = .(assayCategory)]
#> Empty data.table (0 rows and 2 cols): assayCategory,NWe look at the assayType values and numbers of each for the three different assayCategory values.
ccl4_genetox_details[assayCategory == 'in vitro', .N, by = .(assayType)]
#> Empty data.table (0 rows and 2 cols): assayType,N
ccl4_genetox_details[assayCategory == 'ND', .N, by = .(assayType)]
#> Empty data.table (0 rows and 2 cols): assayType,N
ccl4_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
#> Empty data.table (0 rows and 2 cols): assayType,NNow we repeat this for NATADB.
natadb_genetox_details[, .N, by = .(assayCategory, assayType, assayResult)]
#> Empty data.table (0 rows and 4 cols): assayCategory,assayType,assayResult,N
natadb_genetox_details[, .N, by = .(assayCategory)]
#> Empty data.table (0 rows and 2 cols): assayCategory,NExamine the number of rows for each assayType value by each assaycategory value.
natadb_genetox_details[assayCategory == 'in vitro', .N, by = .(assayType)]
#> Empty data.table (0 rows and 2 cols): assayType,N
natadb_genetox_details[assayCategory == 'ND', .N, by = .(assayType)]
#> Empty data.table (0 rows and 2 cols): assayType,N
natadb_genetox_details[assayCategory == 'in vivo', .N, by = .(assayType)]
#> Empty data.table (0 rows and 2 cols): assayType,NAnswer to Environmental Health Question 3
From these initial explorations of the data, we can answer Environmental Health Question 3: After pulling the genotoxicity data for the different environmental contaminant data sets, list the assays associated with the chemicals in each data set. How many unique assays are there in each data set? What are the different assay categories and how many unique assays for each assay category are there?
Answer: There are 87 unique assays for CCl4 and 113 unique assays for NATADB. The different assay categories are “in vitro”, “ND”, and “in vivo”, with 62 unique “in vitro” assays for CCl4 and 82 for NATADB, 2 unique “ND” assays for CCL4 and 2 for NATADB, and 23 unique “in vivo” assays for CCL4 and 29 for NATADB.
Next, we dig into the results of the assays. One may be interested in looking at the number of chemicals for which an assay resulted in a positive or negative result for instance. We group by assayResult and determine the number of unique dtxsid values associated with each assayResult value.
Answer to Environmental Health Question 4
With this data we may now answer Environmental Health Question 4: The genotoxicity data contains information on which assays have been conducted for different chemicals and the results of those assays. How many chemicals in each data set have a ‘positive’, ‘negative’, and ‘equivocal’ value for the assay result?
Answer: For CCL4, there are 63 unique chemicals that have a negative assay result, 53 that have a positive result, and 14 that have an equivocal result. For NATADB, there are 139 unique chemicals that have a negative assay result, 129 that have a positive result, and 47 that have an equivocal result. Observe that since there are 71 unique dtxsid values with assay results in CCL4 and 153 in NATADB, there are several chemicals that have multiple assay results.
We now determine the chemicals from each data set that are known to have genotoxic effects. For this, we look to see which chemicals produce at least one positive response in the assayResult column.
ccl4_genetox_details[, .(is_positive = any(assayResult == 'positive')),
by = .(dtxsid)][is_positive == TRUE, dtxsid]
#> character(0)
natadb_genetox_details[, .(is_positive = any(assayResult == 'positive')),
by = .(dtxsid)][is_positive == TRUE, dtxsid]
#> character(0)With so much genotoxicity data, let us explore this data for one chemical more deeply to get a sense of the assays and results present for it. We will explore the chemical with DTXSID0020153. We will look at the assays, the number of each type of result, and which correspond to “positive” results. To determine this, we group by assayResult and calculate .N for each group. We also isolate which were positive and output a data.table with the number of each type.
ccl4_genetox_details[dtxsid == 'DTXSID0020153', .(Number = .N),
by = .(assayResult)]
#> Empty data.table (0 rows and 2 cols): assayResult,Number
ccl4_genetox_details[dtxsid == 'DTXSID0020153' & assayResult == 'positive',
.(Number_of_assays = .N), by = .(assayType)][order(-Number_of_assays),]
#> Empty data.table (0 rows and 2 cols): assayType,Number_of_assaysAnswer to Environmental Health Question 5
With these data.tables, we may answer Environmental Health Question 5: Based on the genotoxicity data reported for the chemical with DTXSID identifier DTXSID0020153, how many assays resulted in a positive/equivocal/negative value? Which of the assays were positive and how many of each were there for the most reported assays?
Answer: There were five assays that produced a negative result, 20 that produced a positive result, and one that produced an equivocal result. Of the 20 positive assays, “InVitroCA”, “InVitroMLA”, “Ames”, “Sister-chromatid exchange (SCE) in vitro”, bacterial reverse mutation assay” and “Rec-assay, DNA effects (bacterial DNA repair)” were the most numerous, with two each.
Hazard Resource
Finally, we examine the hazard data associated with the chemicals in each data set. For each chemical, there will be potentially hundreds of rows of hazard data, so the returned results will be much larger than in most other API endpoints.
ccl4_hazard <- get_hazard_by_dtxsid_batch(DTXSID = ccl4$dtxsid)
natadb_hazard <- get_hazard_by_dtxsid_batch(DTXSID = natadb$dtxsid)We do some preliminary exploration of the data. First we determine the dimensions of the data sets.
Next we record the column names and display the first six results in the CCL4 hazard data.
colnames(ccl4_hazard)
#> [1] "id" "source" "year"
#> [4] "dtxsid" "exposureRoute" "toxvalNumeric"
#> [7] "toxvalNumericQualifier" "toxvalUnits" "studyType"
#> [10] "studyDurationClass" "studyDuractionValue" "studyDurationUnits"
#> [13] "strain" "sex" "population"
#> [16] "exposureMethod" "exposureForm" "media"
#> [19] "lifestage" "generation" "criticalEffect"
#> [22] "detailText" "supercategory" "speciesCommon"
#> [25] "humanEcoNt" "priorityId" "subsource"
#> [28] "sourceUrl" "subsourceUrl" "riskAssessmentClass"
#> [31] "toxvalType" "toxvalSubtype" "casrn"
#> [34] "name" "toxvalTypeDefinition" "toxvalTypeSuperCategory"
#> [37] "qualifier" "humanEco" "studyDurationValue"
#> [40] "latinName" "speciesSupercategory" "toxicologicalEffect"
#> [43] "experimentalRecord" "studyGroup" "longRef"
#> [46] "doi" "title" "author"
#> [49] "guideline" "quality" "qcCategory"
#> [52] "sourceHash" "externalSourceId" "externalSourceIdDesc"
#> [55] "storedSourceRecord" "toxvalTypeOriginal" "toxvalSubtypeOriginal"
#> [58] "toxvalNumericOriginal" "toxvalUnitsOriginal" "studyTypeOriginal"
#> [61] "studyDurationClassOriginal" "studyDurationValueOriginal" "studyDurationUnitsOriginal"
#> [64] "speciesOriginal" "strainOriginal" "sexOriginal"
#> [67] "generationOriginal" "lifestageOriginal" "exposureRouteOriginal"
#> [70] "exposureMethodOriginal" "exposureFormOriginal" "mediaOriginal"
#> [73] "toxicologicalEffectOriginal" "originalYear"
head(ccl4_hazard)
#> id source year dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#> <int> <char> <char> <char> <char> <num> <char>
#> 1: 43659 Cal OEHHA 2023 DTXSID001024118 oral 0.00003 <NA>
#> 2: 43660 Cal OEHHA 2023 DTXSID001024118 oral 0.00003 <NA>
#> 3: 43661 Cal OEHHA 2023 DTXSID001024118 oral 0.00300 <NA>
#> 4: 43662 Cal OEHHA 2023 DTXSID001024118 oral 0.00300 <NA>
#> 5: 233997 NIOSH 1994 DTXSID0020153 inhalation 51.77120 <NA>
#> 6: 258405 PPRTV (CPHEA) - DTXSID0020153 oral 6.40000 <NA>
#> 67 variables not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]We determine the number of unique values in the criticalEffect, toxvalTypeSuperCategory, and toxvalType columns for each data set.
The number of unique values for criticalEffect.
length(ccl4_hazard[, unique(criticalEffect)])
#> [1] 1
length(natadb_hazard[, unique(criticalEffect)])
#> [1] 1The number of unique values of toxvalTypeSuperCategory.
length(ccl4_hazard[, unique(toxvalTypeSuperCategory)])
#> [1] 5
length(natadb_hazard[, unique(toxvalTypeSuperCategory)])
#> [1] 5The number of unique values for toxvalType.
length(ccl4_hazard[, unique(toxvalType)])
#> [1] 131
length(natadb_hazard[, unique(toxvalType)])
#> [1] 158Now we look at the number of entries per toxvalTypeSuperCategory.
ccl4_hazard[, .N, by = .(toxvalTypeSuperCategory)]
#> toxvalTypeSuperCategory N
#> <char> <int>
#> 1: Media Exposure Guidelines 2209
#> 2: Acute Exposure Guidelines 733
#> 3: Dose Response Summary Val 7489
#> 4: Toxicity Value 1236
#> 5: Mortality Response Summar 550
natadb_hazard[, .N, by = .(toxvalTypeSuperCategory)]
#> toxvalTypeSuperCategory N
#> <char> <int>
#> 1: Media Exposure Guidelines 5607
#> 2: Acute Exposure Guidelines 2380
#> 3: Toxicity Value 3516
#> 4: Dose Response Summary Val 8045
#> 5: Mortality Response Summar 991With over 7,000 results for the toxvalTypeSuperCategory value “Dose Response Summary Value” for each data set, we dig into this further.
We determine the number of rows grouped by toxvalType that have the “Dose Response Summary Value” toxvalTypeSuperCategory value, and display this descending.
ccl4_hazard[toxvalTypeSuperCategory %in% 'Dose Response Summary Value', .N,
by = .(toxvalType)][order(-N),]
#> toxvalType N
#> <char> <int>
#> 1: NOEL 1830
#> 2: LOEL 1580
#> 3: NOAEL 1175
#> 4: LEL 935
#> 5: LOAEL 930
#> 6: NEL 469
#> 7: BMDL (05) 96
#> 8: NOAEC 71
#> 9: LOAEL (HEC) 48
#> 10: LOAEC 43
#> 11: NOAEL (HEC) 40
#> 12: NOAEL (ADJ) 35
#> 13: LOAEL (ADJ) 32
#> 14: BMDL (10) 24
#> 15: NOEL (TAD) 22
#> 16: BMCL (HEC) 20
#> 17: NOEC 18
#> 18: LOEL (TAD) 16
#> 19: LOEC 15
#> 20: BMCL (10) 14
#> 21: BMDL (1SD) 14
#> 22: BMCL (10 HEC) 12
#> 23: NOAEL (HED) 6
#> 24: BMCL (ADJ) 4
#> 25: BMC (10) 4
#> 26: FEL 4
#> 27: BMDL (5RD HED) 4
#> 28: BMCL (05 RD) 2
#> 29: BMCL 2
#> 30: BMC 2
#> 31: AEL 2
#> 32: BMCL (1SD) 2
#> 33: FEL (HEC) 2
#> 34: FEL (ADJ) 2
#> 35: BMD (2.5) 2
#> 36: BMDL 2
#> 37: LOAEL (HED) 2
#> 38: BMDL (0.5SD HED) 2
#> 39: BMDL (ADJ) 2
#> 40: BMDL (HED) 2
#> 41: BMDL (05 HED) 2
#> toxvalType N
natadb_hazard[toxvalTypeSuperCategory %in% 'Dose Response Summary Value', .N,
by = .(toxvalType)][order(-N),]
#> toxvalType N
#> <char> <int>
#> 1: NOEL 1660
#> 2: NOAEL 1611
#> 3: LOEL 1570
#> 4: LOAEL 1141
#> 5: LEL 581
#> 6: NOAEC 317
#> 7: NEL 219
#> 8: LOAEL (HEC) 126
#> 9: NOAEL (HEC) 114
#> 10: LOAEL (ADJ) 86
#> 11: LOAEC 80
#> 12: NOAEL (ADJ) 80
#> 13: BMDL (10) 69
#> 14: BMDL (1SD) 40
#> 15: NOEC 37
#> 16: BMCL (10 HEC) 36
#> 17: BMCL (10) 36
#> 18: BMCL (HEC) 30
#> 19: BMD (50) 16
#> 20: LOEC 14
#> 21: BMDL 12
#> 22: BMDL (HEC) 10
#> 23: BMC (10) 10
#> 24: BMDL (HED) 8
#> 25: BMCL (10 ADJ) 8
#> 26: BMCL (1SD) 8
#> 27: BMDL (05) 8
#> 28: BMDL (10 HED) 8
#> 29: BMDL (01) 6
#> 30: LOAEL (99 HED) 6
#> 31: BMDL (10 ADJ) 6
#> 32: BMCL (1 SD HEC) 6
#> 33: BMC (10 HEC) 6
#> 34: FEL 6
#> 35: NOAEL (HED) 6
#> 36: BMDL (05 HED) 4
#> 37: BMDL (01 HED99) 4
#> 38: BMC 4
#> 39: BMCL (ADJ) 4
#> 40: BMCL 4
#> 41: BMDL (10 HEC) 4
#> 42: BMD (2.5) 4
#> 43: LOAEL (HED) 4
#> 44: FEL (ADJ) 4
#> 45: FEL (HEC) 4
#> 46: BMCL (0.25 SD) 4
#> 47: BMD 4
#> 48: BMCL (05 RD) 2
#> 49: BMDL (01 99 HEC) 2
#> 50: LOAEL (99 HEC) 2
#> 51: BMDL (01 99 HED) 2
#> 52: BMCL (1SD HEC) 2
#> 53: BMDL (ADJ) 2
#> 54: BMDL (5 ADJ) 2
#> 55: BMC (HEC) 2
#> 56: BMD (2X) 2
#> 57: BMDL (2X) 2
#> toxvalType NWe explore “NOAEL”, “LOAEL”, and “NOEL” further. Let us look at the the case when media value is either “food” or “culture”. For this, we will recover the minimum value of “NOAEL”, “LOAEL”, and “NOEL” for each chemical in each data set.
First, we look at “food”. We order by toxvalType and by the minimum toxvalNumeric value in each group, descending.
ccl4_hazard[media %in% 'food' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
.(toxvalNumeric = min(toxvalNumeric)),
by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
-toxvalNumeric)]
#> toxvalType toxvalUnits dtxsid toxvalNumeric
#> <char> <char> <char> <num>
#> 1: LOAEL mg/kg-day DTXSID2021731 250.0
#> 2: LOAEL mg/kg-day DTXSID4020533 1.0
#> 3: NOAEL mg/kg-day DTXSID2021731 50.0
#> 4: NOAEL mg/kg-day DTXSID7020637 9.4
#> 5: NOAEL mg/kg-day DTXSID4020533 0.5
natadb_hazard[media %in% 'food' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
.(toxvalNumeric = min(toxvalNumeric)),
by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
-toxvalNumeric)]
#> toxvalType toxvalUnits dtxsid toxvalNumeric
#> <char> <char> <char> <num>
#> 1: LOAEL mg/kg-day DTXSID2021781 1833.000
#> 2: LOAEL mg/kg-day DTXSID3039242 263.600
#> 3: LOAEL mg/kg-day DTXSID7021360 260.000
#> 4: LOAEL mg/kg-day DTXSID2021731 250.000
#> 5: LOAEL mg/kg-day DTXSID5020607 183.000
#> 6: LOAEL mg/kg-day DTXSID2021105 70.700
#> 7: LOAEL mg/kg-day DTXSID0020868 50.000
#> 8: LOAEL mg/kg-day DTXSID1020306 41.000
#> 9: LOAEL mg/kg-day DTXSID6020438 34.400
#> 10: LOAEL mg/kg-day DTXSID9020827 8.000
#> 11: LOAEL mg/kg-day DTXSID0021383 7.000
#> 12: LOAEL mg/kg-day DTXSID2021319 7.000
#> 13: LOAEL mg/kg-day DTXSID2021446 2.600
#> 14: LOAEL mg/kg-day DTXSID7021106 2.400
#> 15: LOAEL mg/kg-day DTXSID8021434 1.700
#> 16: LOAEL mg/kg-day DTXSID3020679 1.000
#> 17: LOAEL mg/kg-day DTXSID4020533 1.000
#> 18: LOAEL mg/kg-day DTXSID7020687 0.140
#> 19: NOAEL mg/kg-day DTXSID0021381 1000.000
#> 20: NOAEL mg/kg-day DTXSID2021781 550.000
#> 21: NOAEL mg/kg-day DTXSID2021731 50.000
#> 22: NOAEL mg/kg-day DTXSID3039242 26.360
#> 23: NOAEL mg/kg-day DTXSID7021360 26.000
#> 24: NOAEL mg/kg-day DTXSID5021889 25.000
#> 25: NOAEL mg/kg-day DTXSID5020607 18.300
#> 26: NOAEL mg/kg-day DTXSID6020438 17.200
#> 27: NOAEL mg/kg-day DTXSID8020250 16.000
#> 28: NOAEL mg/kg-day DTXSID1020306 15.000
#> 29: NOAEL mg/kg-day DTXSID7020637 9.400
#> 30: NOAEL mg/kg-day DTXSID7021368 8.000
#> 31: NOAEL mg/kg-day DTXSID2021105 7.070
#> 32: NOAEL mg/kg-day DTXSID0020868 5.850
#> 33: NOAEL mg/kg-day DTXSID9020827 4.000
#> 34: NOAEL mg/kg-day DTXSID8021438 2.500
#> 35: NOAEL mg/kg-day DTXSID2021446 2.100
#> 36: NOAEL mg/kg-day DTXSID2021319 1.400
#> 37: NOAEL mg/kg-day DTXSID0021383 0.700
#> 38: NOAEL mg/kg-day DTXSID4020533 0.500
#> 39: NOAEL mg/kg-day DTXSID7021106 0.240
#> 40: NOAEL mg/kg-day DTXSID8021434 0.170
#> 41: NOAEL mg/kg-day DTXSID3020679 0.100
#> 42: NOAEL mg/kg-day DTXSID7020687 0.014
#> toxvalType toxvalUnits dtxsid toxvalNumericNext we look at “culture”, repeating the same grouping and ordering as in the previous case.
ccl4_hazard[media %in% 'culture' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
.(toxvalNumeric = min(toxvalNumeric)),
by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
-toxvalNumeric)]
#> toxvalType toxvalUnits dtxsid toxvalNumeric
#> <char> <char> <char> <num>
#> 1: NOEL mg/L DTXSID9021427 300.0
#> 2: NOEL mg/L DTXSID5020601 10.0
#> 3: NOEL mg/mL DTXSID2021731 2.0
#> 4: NOEL mM DTXSID5020601 0.1
natadb_hazard[media %in% 'culture' & toxvalType %in% c('LOAEL', 'NOAEL', 'NOEL'),
.(toxvalNumeric = min(toxvalNumeric)),
by = .(toxvalType, toxvalUnits, dtxsid)][order(toxvalType,
-toxvalNumeric)]
#> toxvalType toxvalUnits dtxsid toxvalNumeric
#> <char> <char> <char> <num>
#> 1: NOEL mg/L DTXSID0039227 30.0
#> 2: NOEL mg/L DTXSID5020601 10.0
#> 3: NOEL mg/mL DTXSID2021731 2.0
#> 4: NOEL mM DTXSID5020601 0.1Now, let us restrict our attention to human hazard and focus on the exposure routes given by inhalation and oral.
First, let us determine the exposure routes in general.
ccl4_hazard[humanEco %in% 'human health', unique(exposureRoute)]
#> [1] "oral" "inhalation" "Inhalation" "multiple" "not specified"
#> [6] "injection" "dermal" "environmental" "implant" "osmotic minipump"
#> [11] "maternal"
natadb_hazard[humanEco %in% 'human health', unique(exposureRoute)]
#> [1] "oral" "inhalation" "multiple" "Inhalation" "not specified" "environmental"
#> [7] "injection" "dermal"Then, let’s focus on the inhalation and oral exposure routes for human hazard.
To answer this, filter the data into the corresponding exposure routes, then group by exposureRoute and riskAssessmentClass, and finally count the number of instances for each grouping. To determine the most represented class, one can order the results descending.
ccl4_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation', 'oral'), .(Hits = .N),
by = .(exposureRoute, riskAssessmentClass)][order(exposureRoute,
-Hits)]
#> exposureRoute riskAssessmentClass Hits
#> <char> <char> <int>
#> 1: inhalation Air 987
#> 2: inhalation - 796
#> 3: inhalation Non-cancer 362
#> 4: inhalation Cancer 200
#> 5: inhalation Developmental/Reproductiv 15
#> 6: oral - 6133
#> 7: oral Water 852
#> 8: oral Non-cancer 278
#> 9: oral Cancer 115
#> 10: oral Developmental/Reproductiv 11
natadb_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation', 'oral'), .(Hits = .N),
by = .(exposureRoute, riskAssessmentClass)][order(exposureRoute,
-Hits)]
#> exposureRoute riskAssessmentClass Hits
#> <char> <char> <int>
#> 1: inhalation Air 3036
#> 2: inhalation - 2027
#> 3: inhalation Non-cancer 1090
#> 4: inhalation Cancer 715
#> 5: inhalation Developmental/Reproductiv 23
#> 6: oral - 5669
#> 7: oral Water 2278
#> 8: oral Non-cancer 721
#> 9: oral Cancer 448
#> 10: oral Developmental/Reproductiv 39Answer to Environmental Health Question 6
With these results we may answer Environmental Health Question 6: After pulling the hazard data for the different data sets, list the different exposure routes for which there is data. What are the unique risk assessment classes for hazard values for the oral route and for the inhalation exposure route? For each such exposure route, which risk assessment class is most represented by the data sets?
Answer: We listed the general exposure routes above for the hazard data associated with the chemicals in each data set. Restricting our attention to human hazard data, the “air” riskAssessmentClass is most represented by the inhalation exposure route and “water” for the oral exposure route for both the CCL4 and NATADB data sets.
We now drill down a little further before moving into a different path for data exploration. We explore the different types of toxicity values present in each data set for the inhalation and oral exposure routes, and then see which of these are common to both exposure routes for each data set.
To answer this, we filter the rows to the “human health” humanEco value and “inhalation” or “oral” exposureRoute value. Then we return the unique values that toxvalType takes.
First we look at CCL4.
ccl4_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation'), unique(toxvalType)]
#> [1] "IDLH" "Reference Exposure Level"
#> [3] "cancer slope factor" "cancer unit risk"
#> [5] "air contaminant limit" "screening level (industrial air)"
#> [7] "screening level (residential air)" "LC50"
#> [9] "LEL" "PAC-3"
#> [11] "PAC-2" "PAC-1"
#> [13] "RfC (provisional)" "BMCL (HEC)"
#> [15] "DNEL systemic" "NEL"
#> [17] "NOAEL" "LOAEL"
#> [19] "RfC" "AEGL 2 - 60 min (final)"
#> [21] "AEGL 2 - 10 min (final)" "AEGL 2 - 30 min (final)"
#> [23] "MRL" "BMCL (05 RD)"
#> [25] "AEGL 2 - 4 hr (final)" "AEGL 2 - 8 hr (final)"
#> [27] "AEGL 3 - 60 min (final)" "AEGL 3 - 30 min (final)"
#> [29] "NOAEL (HEC)" "AEGL 3 - 4 hr (final)"
#> [31] "AEGL 3 - 10 min (final)" "AEGL 3 - 8 hr (final)"
#> [33] "DNEL local" "NOAEC"
#> [35] "LOAEL (HEC)" "LOAEL (ADJ)"
#> [37] "NOAEL (ADJ)" "RfD"
#> [39] "BMCL" "BMCL (ADJ)"
#> [41] "BMC" "AEL"
#> [43] "BMCL (10)" "LOAEC"
#> [45] "Level of Distinct Odor Awareness (LOA)" "AEGL 3 - 8 hr (interim)"
#> [47] "AEGL 3 - 4 hr (interim)" "AEGL 3 - 60 min (interim)"
#> [49] "AEGL 3 - 30 min (interim)" "AEGL 3 - 10 min (interim)"
#> [51] "AEGL 2 - 8 hr (interim)" "AEGL 2 - 4 hr (interim)"
#> [53] "AEGL 2 - 60 min (interim)" "AEGL 2 - 30 min (interim)"
#> [55] "OEHHA MADL" "BMDL (1SD)"
#> [57] "AEGL 1 - 10 min (interim)" "AEGL 1 - 30 min (interim)"
#> [59] "AEGL 1 - 60 min (interim)" "AEGL 1 - 4 hr (interim)"
#> [61] "AEGL 1 - 8 hr (interim)" "AEGL 2 - 10 min (interim)"
#> [63] "BMCL (10 HEC)" "BMC (10)"
#> [65] "AEGL 1 - 4 hr (final)" "AEGL 1 - 60 min (final)"
#> [67] "AEGL 1 - 10 min (final)" "AEGL 1 - 30 min (final)"
#> [69] "AEGL 1 - 8 hr (final)" "tolerable concentration in air"
#> [71] "BMCL (1SD)" "NOEC"
#> [73] "FEL (HEC)" "FEL"
#> [75] "FEL (ADJ)" "LOEC"
#> [77] "OEHHA NSRL" "AEGL 2 - 8 hr (proposed)"
#> [79] "AEGL 2 - 4 hr (proposed)" "AEGL 2 - 60 min (proposed)"
#> [81] "AEGL 2 - 30 min (proposed)" "AEGL 2 - 10 min (proposed)"
#> [83] "AEGL 1 - 8 hr (proposed)" "AEGL 1 - 4 hr (proposed)"
#> [85] "AEGL 1 - 60 min (proposed)" "AEGL 1 - 30 min (proposed)"
#> [87] "AEGL 1 - 10 min (proposed)" "AEGL 3 - 8 hr (proposed)"
#> [89] "AEGL 3 - 4 hr (proposed)" "AEGL 3 - 60 min (proposed)"
#> [91] "AEGL 3 - 30 min (proposed)" "AEGL 3 - 10 min (proposed)"
#> [93] "LOEL" "NOEL"
ccl4_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('oral'), unique(toxvalType)]
#> [1] "drinking water quality guideline" "LOAEL (ADJ)"
#> [3] "cancer slope factor" "RfD (provisional)"
#> [5] "risk-based SSL, groundwater" "LD50"
#> [7] "Medium-Specific Concentration" "LOEL"
#> [9] "DWEL" "MEG"
#> [11] "NEL" "LOAEL"
#> [13] "NOAEL" "LEL"
#> [15] "health advisory" "RfD"
#> [17] "NOEL" "HBSL"
#> [19] "LC50" "ADI"
#> [21] "cancer unit risk" "ORSG"
#> [23] "MCL" "NOAEL (ADJ)"
#> [25] "OEHHA PHG" "BMDL (10)"
#> [27] "MRL" "LOEC"
#> [29] "OEHHA MADL" "BMDL (05)"
#> [31] "BMDL (1SD)" "TDI"
#> [33] "NOEL (TAD)" "LOEL (TAD)"
#> [35] "BMDL (5RD HED)" "NOAEL (HED)"
#> [37] "NOEC" "FEL"
#> [39] "BMD (2.5)" "BMDL"
#> [41] "LOAEL (HED)" "BMDL (0.5SD HED)"
#> [43] "BMDL (ADJ)" "BMDL (HED)"
#> [45] "BMDL (05 HED)"
intersect(ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'inhalation', unique(toxvalType)], ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral', unique(toxvalType)])
#> [1] "cancer slope factor" "cancer unit risk" "LC50" "LEL"
#> [5] "NEL" "NOAEL" "LOAEL" "MRL"
#> [9] "LOAEL (ADJ)" "NOAEL (ADJ)" "RfD" "OEHHA MADL"
#> [13] "BMDL (1SD)" "NOEC" "FEL" "LOEC"
#> [17] "LOEL" "NOEL"Then we look at NATADB.
natadb_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('inhalation'), unique(toxvalType)]
#> [1] "IDLH" "cancer unit risk"
#> [3] "screening level (industrial air)" "screening level (residential air)"
#> [5] "LEL" "PAC-3"
#> [7] "PAC-2" "PAC-1"
#> [9] "Reference Exposure Level" "cancer slope factor"
#> [11] "LC50" "air contaminant limit"
#> [13] "BMCL (HEC)" "RfC (provisional)"
#> [15] "BMCL (10 HEC)" "RfD"
#> [17] "LOAEL (ADJ)" "LOAEL"
#> [19] "MRL" "LOAEC"
#> [21] "LOAEL (HEC)" "BMCL (10)"
#> [23] "DNEL systemic" "RfC"
#> [25] "LOEC" "NOEC"
#> [27] "AEGL 3 - 60 min (final)" "AEGL 2 - 4 hr (final)"
#> [29] "AEGL 3 - 30 min (final)" "AEGL 2 - 8 hr (final)"
#> [31] "AEGL 3 - 8 hr (final)" "AEGL 3 - 10 min (final)"
#> [33] "AEGL 3 - 4 hr (final)" "AEGL 2 - 60 min (final)"
#> [35] "NEL" "NOAEL"
#> [37] "AEGL 2 - 30 min (final)" "AEGL 2 - 10 min (final)"
#> [39] "NOAEL (HEC)" "BMCL (05 RD)"
#> [41] "NOAEC" "BMDL (HEC)"
#> [43] "OEHHA NSRL" "AEGL 3 - 8 hr (interim)"
#> [45] "AEGL 3 - 4 hr (interim)" "AEGL 3 - 60 min (interim)"
#> [47] "AEGL 3 - 30 min (interim)" "AEGL 3 - 10 min (interim)"
#> [49] "AEGL 2 - 8 hr (interim)" "AEGL 2 - 4 hr (interim)"
#> [51] "AEGL 2 - 60 min (interim)" "AEGL 2 - 30 min (interim)"
#> [53] "AEGL 2 - 10 min (interim)" "AEGL 1 - 60 min (interim)"
#> [55] "AEGL 1 - 30 min (interim)" "AEGL 1 - 10 min (interim)"
#> [57] "AEGL 1 - 4 hr (interim)" "AEGL 1 - 8 hr (interim)"
#> [59] "NOAEL (ADJ)" "NOEL"
#> [61] "BMC" "BMCL (ADJ)"
#> [63] "BMCL" "tolerable concentration in air"
#> [65] "BMCL (1SD HEC)" "AEGL 1 - 30 min (proposed)"
#> [67] "AEGL 3 - 8 hr (proposed)" "AEGL 3 - 4 hr (proposed)"
#> [69] "AEGL 2 - 8 hr (proposed)" "AEGL 3 - 10 min (proposed)"
#> [71] "AEGL 2 - 30 min (proposed)" "AEGL 2 - 60 min (proposed)"
#> [73] "AEGL 1 - 8 hr (proposed)" "AEGL 3 - 60 min (proposed)"
#> [75] "AEGL 2 - 10 min (proposed)" "AEGL 1 - 4 hr (proposed)"
#> [77] "AEGL 3 - 30 min (proposed)" "AEGL 2 - 4 hr (proposed)"
#> [79] "AEGL 1 - 10 min (proposed)" "AEGL 1 - 60 min (proposed)"
#> [81] "DNEL local" "BMDL (10 HEC)"
#> [83] "BMDL (10)" "AEGL 1 - 30 min (final)"
#> [85] "AEGL 1 - 10 min (final)" "AEGL 1 - 60 min (final)"
#> [87] "AEGL 1 - 8 hr (final)" "AEGL 1 - 4 hr (final)"
#> [89] "BMCL (10 ADJ)" "BMCL (1 SD HEC)"
#> [91] "BMCL (1SD)" "Level of Distinct Odor Awareness (LOA)"
#> [93] "LD50" "BMC (10 HEC)"
#> [95] "BMC (10)" "LC100"
#> [97] "OEHHA MADL" "BMDL (1SD)"
#> [99] "FEL (ADJ)" "FEL"
#> [101] "FEL (HEC)" "RfD (provisional)"
#> [103] "BMDL (ADJ)" "BMCL (0.25 SD)"
#> [105] "LC0" "BMC (HEC)"
#> [107] "LOEL"
natadb_hazard[humanEco %in% 'human health' &
exposureRoute %in% c('oral'), unique(toxvalType)]
#> [1] "Medium-Specific Concentration" "risk-based SSL, groundwater"
#> [3] "cancer slope factor" "RfD (provisional)"
#> [5] "LOAEL (ADJ)" "LD50"
#> [7] "LOAEL" "MCL"
#> [9] "NOAEL" "NOEL"
#> [11] "LOEL" "LEL"
#> [13] "MRL" "MCL-based SSL, groundwater"
#> [15] "MEG" "BMDL (05 HED)"
#> [17] "BMDL (1SD)" "NEL"
#> [19] "cancer unit risk" "MMCL"
#> [21] "OEHHA PHG" "RfD"
#> [23] "HBSL" "BMDL (HED)"
#> [25] "TDI" "BMDL (10)"
#> [27] "BMDL (01 99 HEC)" "LOAEL (99 HEC)"
#> [29] "RfC" "BMDL (01)"
#> [31] "BMDL (01 HED99)" "LOAEL (99 HED)"
#> [33] "BMDL (01 99 HED)" "OEHHA NSRL"
#> [35] "health advisory" "DWEL"
#> [37] "MRDL" "BMDL (10 ADJ)"
#> [39] "ORSG" "NOAEL (ADJ)"
#> [41] "BMD (2.5)" "NOAEC"
#> [43] "tolerable concentration in air" "Reference Exposure Level"
#> [45] "LOAEL (HED)" "LC50"
#> [47] "PMTDI" "BMDL"
#> [49] "BMDL (05)" "OEHHA MADL"
#> [51] "BMCL (0.25 SD)" "ADI"
#> [53] "BMDL (5 ADJ)" "BMDL (10 HED)"
#> [55] "LD100" "drinking water quality guideline"
#> [57] "FEL" "NOAEL (HED)"
#> [59] "NOEC" "LOEC"
#> [61] "LD0" "BMD"
#> [63] "BMD (50)" "BMD (2X)"
#> [65] "BMDL (2X)" "NOAEL (HEC)"
#> [67] "LOAEL (HEC)"
intersect(natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'inhalation', unique(toxvalType)], natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral', unique(toxvalType)])
#> [1] "cancer unit risk" "LEL"
#> [3] "Reference Exposure Level" "cancer slope factor"
#> [5] "LC50" "RfD"
#> [7] "LOAEL (ADJ)" "LOAEL"
#> [9] "MRL" "LOAEL (HEC)"
#> [11] "RfC" "LOEC"
#> [13] "NOEC" "NEL"
#> [15] "NOAEL" "NOAEL (HEC)"
#> [17] "NOAEC" "OEHHA NSRL"
#> [19] "NOAEL (ADJ)" "NOEL"
#> [21] "tolerable concentration in air" "BMDL (10)"
#> [23] "LD50" "OEHHA MADL"
#> [25] "BMDL (1SD)" "FEL"
#> [27] "RfD (provisional)" "BMCL (0.25 SD)"
#> [29] "LOEL"Answer to Environmental Health Question 7
With the results above, we may answer Environmental Health Question 7: There are several types of toxicity values for each exposure route. List the unique toxicity values for the oral and inhalation routes. What are the unique types of toxicity values for the oral route and for the inhalation route? How many of these are common to both the oral and inhalation routes for each data set?
Answer: There are 18 toxicity value types shared between the oral and inhalation exposure routes for CCL4 and 29 for NATADB. The lists above indicate the variety of toxicity values present in the hazard data for the two different exposure routes we have considered.
For the next data exploration, we will examine the “NOAEL” and “LOAEL” values for chemicals with oral exposure and human hazard. We also examine the units to determine whether any unit conversions are necessary to compare numeric values.
ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL'), ]
#> id source year dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#> <int> <char> <char> <char> <char> <num> <char>
#> 1: 62837 ToxRefDB 1990 DTXSID0020446 oral 139.0 <NA>
#> 2: 62835 ToxRefDB 1990 DTXSID0020446 oral 22.1 <NA>
#> 3: 62833 ToxRefDB 1990 DTXSID0020446 oral 157.0 <NA>
#> 4: 62831 ToxRefDB 1990 DTXSID0020446 oral 18.9 <NA>
#> 5: 62829 ToxRefDB 1990 DTXSID0020446 oral 139.0 <NA>
#> ---
#> 1722: 71703 ToxRefDB 1996 DTXSID9032329 oral 86.5 <NA>
#> 1723: 71697 ToxRefDB 1996 DTXSID9032329 oral 2.3 <NA>
#> 1724: 71701 ToxRefDB 1996 DTXSID9032329 oral 15.4 <NA>
#> 1725: 71723 ToxRefDB 1995 DTXSID9032329 oral 15.0 <NA>
#> 1726: 71725 ToxRefDB 1995 DTXSID9032329 oral 1.0 <NA>
#> 67 variables not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL'), unique(toxvalUnits)]
#> [1] "mg/kg-day" "ppm" "mg/L"
natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL'), ]
#> id source year dtxsid exposureRoute toxvalNumeric toxvalNumericQualifier
#> <int> <char> <char> <char> <char> <num> <char>
#> 1: 112819 ECHA IUCLID 2003 DTXSID0020448 oral 125.000 <NA>
#> 2: 98982 ECHA IUCLID 2003 DTXSID0020448 oral 250.000 <NA>
#> 3: 98983 ECHA IUCLID 2003 DTXSID0020448 oral 500.000 <NA>
#> 4: 112820 ECHA IUCLID 2003 DTXSID0020448 oral 250.000 <NA>
#> 5: 112817 ECHA IUCLID 2003 DTXSID0020448 oral 62.000 <NA>
#> ---
#> 2026: 18168 IRIS - DTXSID9021261 oral 0.015 <NA>
#> 2027: 18169 IRIS - DTXSID9021261 oral 0.015 <NA>
#> 2028: 15104 Health Canada 1975 DTXSID9021261 oral 0.007 <NA>
#> 2029: 15100 Health Canada 2009 DTXSID9021261 oral 0.800 <NA>
#> 2030: 18170 IRIS - DTXSID9021261 oral 0.023 <NA>
#> 67 variables not shown: [toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, exposureForm <char>, ...]
natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL'), unique(toxvalUnits)]
#> [1] "mg/kg-day" "ppm" "mg/L" "mg/day"Observe that for both CCL4 and NATADB, the units are given by “mg/kg-day”, “ppm”, “mg/L” and additionally “-” for NATADB. In this case, we treat “mg/kg-day” and “ppm” the same and exclude “-” and “mg/L”. We group by DTXSID to find the lowest or highest value.
ccl4_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL') & !(toxvalUnits %in% c('-', 'mg/L')),
.(numeric_value = min(toxvalNumeric),
units = toxvalUnits[[which.min(toxvalNumeric)]]),
by = .(dtxsid, toxvalType)]
#> dtxsid toxvalType numeric_value units
#> <char> <char> <num> <char>
#> 1: DTXSID0020446 LOAEL 1.00 mg/kg-day
#> 2: DTXSID0020446 NOAEL 0.66 mg/kg-day
#> 3: DTXSID0020573 NOAEL 300.00 mg/kg-day
#> 4: DTXSID0021464 LOAEL 2.50 mg/kg-day
#> 5: DTXSID0021464 NOAEL 0.70 mg/kg-day
#> ---
#> 107: DTXSID9024142 LOAEL 40.00 mg/kg-day
#> 108: DTXSID9032113 NOAEL 2.94 mg/kg-day
#> 109: DTXSID9032113 LOAEL 4.39 mg/kg-day
#> 110: DTXSID9032329 NOAEL 0.50 mg/kg-day
#> 111: DTXSID9032329 LOAEL 1.00 mg/kg-day
natadb_hazard[humanEco %in% 'human health' & exposureRoute %in% 'oral' &
toxvalType %in% c('NOAEL', 'LOAEL') & !(toxvalUnits %in% c('-', 'mg/L')),
.(numeric_value = min(toxvalNumeric),
units = toxvalUnits[[which.min(toxvalNumeric)]]),
by = .(dtxsid, toxvalType)]
#> dtxsid toxvalType numeric_value units
#> <char> <char> <num> <char>
#> 1: DTXSID0020448 LOAEL 100.000 mg/kg-day
#> 2: DTXSID0020448 NOAEL 20.000 mg/kg-day
#> 3: DTXSID0020523 NOAEL 50.000 ppm
#> 4: DTXSID0020523 LOAEL 0.070 mg/kg-day
#> 5: DTXSID0020529 LOAEL 1.500 mg/kg-day
#> ---
#> 179: DTXSID9020827 LOAEL 5.000 mg/kg-day
#> 180: DTXSID9020827 NOAEL 4.000 mg/kg-day
#> 181: DTXSID9021138 NOAEL 10.000 mg/kg-day
#> 182: DTXSID9021261 NOAEL 0.007 mg/kg-day
#> 183: DTXSID9021261 LOAEL 0.023 mg/kg-dayNow, we also explore the values of “RfD”, “RfC”, and “cancer slope factor” of the toxvalType rows. We first determine the set of units for each, make appropriate conversions if necessary, and then make comparisons.
ccl4_hazard[humanEco %in% 'human health' & toxvalType %in%
c('cancer slope factor', 'RfD', 'RfC'), .N,
by = .(toxvalType, toxvalUnits)][order(toxvalType, -N)]
#> toxvalType toxvalUnits N
#> <char> <char> <int>
#> 1: RfC mg/m3 102
#> 2: RfD mg/kg-day 207
#> 3: cancer slope factor (mg/kg-day)-1 252
natadb_hazard[humanEco %in% 'human health' & toxvalType %in%
c('cancer slope factor', 'RfD', 'RfC'), .N,
by = .(toxvalType, toxvalUnits)][order(toxvalType, -N)]
#> toxvalType toxvalUnits N
#> <char> <char> <int>
#> 1: RfC mg/m3 319
#> 2: RfD mg/kg-day 500
#> 3: cancer slope factor (mg/kg-day)-1 805For CCL4 and NATADB, there is a single unit type for each toxvalType value, so no unit conversions are necessary.
First, we filter and separate out the relevant data subsets.
# Separate out into relevant data subsets
ccl4_csf <- ccl4_hazard[humanEco %in% 'human health' &
toxvalType %in% c('cancer slope factor') & (toxvalUnits != 'mg/kg-day'), ]
ccl4_rfc <- ccl4_hazard[humanEco %in% 'human health' &
toxvalType %in% c('RfC'), ]
ccl4_rfd <- ccl4_hazard[humanEco %in% 'human health' &
toxvalType %in% c('RfD'), ]While there are no unit conversions needed, we demonstrate how we would convert units if they were required.
# Set mass by volume units to mg/m3, so scale g/m3 by 1E3 and ug/m3 by 1E-3
ccl4_rfc[toxvalUnits == 'mg/m3', conversion := 1]
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236092 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.0010
#> 2: 252910 RSL - DTXSID0020600 inhalation 0.1300
#> 3: 252909 RSL - DTXSID0020600 inhalation 0.1300
#> 4: 252908 RSL - DTXSID0020600 inhalation 0.1300
#> 5: 252907 RSL - DTXSID0020600 inhalation 0.1300
#> ---
#> 98: 16359 IRIS - DTXSID8020832 inhalation 0.0050
#> 99: 16358 IRIS - DTXSID8020832 inhalation 0.0050
#> 100: 236659 Pennsylvania DEP ToxValue - DTXSID9021390 not specified 0.0003
#> 101: 15292 IRIS - DTXSID9021390 inhalation 0.0003
#> 102: 15293 IRIS - DTXSID9021390 inhalation 0.0003
#> 69 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]
ccl4_rfc[toxvalUnits == 'g/m3', conversion := 1E3]
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236092 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.0010
#> 2: 252910 RSL - DTXSID0020600 inhalation 0.1300
#> 3: 252909 RSL - DTXSID0020600 inhalation 0.1300
#> 4: 252908 RSL - DTXSID0020600 inhalation 0.1300
#> 5: 252907 RSL - DTXSID0020600 inhalation 0.1300
#> ---
#> 98: 16359 IRIS - DTXSID8020832 inhalation 0.0050
#> 99: 16358 IRIS - DTXSID8020832 inhalation 0.0050
#> 100: 236659 Pennsylvania DEP ToxValue - DTXSID9021390 not specified 0.0003
#> 101: 15292 IRIS - DTXSID9021390 inhalation 0.0003
#> 102: 15293 IRIS - DTXSID9021390 inhalation 0.0003
#> 69 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]
ccl4_rfc[toxvalUnits == 'ug/m3', conversion := 1E-3]
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236092 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.0010
#> 2: 252910 RSL - DTXSID0020600 inhalation 0.1300
#> 3: 252909 RSL - DTXSID0020600 inhalation 0.1300
#> 4: 252908 RSL - DTXSID0020600 inhalation 0.1300
#> 5: 252907 RSL - DTXSID0020600 inhalation 0.1300
#> ---
#> 98: 16359 IRIS - DTXSID8020832 inhalation 0.0050
#> 99: 16358 IRIS - DTXSID8020832 inhalation 0.0050
#> 100: 236659 Pennsylvania DEP ToxValue - DTXSID9021390 not specified 0.0003
#> 101: 15292 IRIS - DTXSID9021390 inhalation 0.0003
#> 102: 15293 IRIS - DTXSID9021390 inhalation 0.0003
#> 69 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]
ccl4_rfc[toxvalUnits %in% c('mg/m3', 'g/m3', 'ug/m3'), units := 'mg/m3']
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236092 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.0010
#> 2: 252910 RSL - DTXSID0020600 inhalation 0.1300
#> 3: 252909 RSL - DTXSID0020600 inhalation 0.1300
#> 4: 252908 RSL - DTXSID0020600 inhalation 0.1300
#> 5: 252907 RSL - DTXSID0020600 inhalation 0.1300
#> ---
#> 98: 16359 IRIS - DTXSID8020832 inhalation 0.0050
#> 99: 16358 IRIS - DTXSID8020832 inhalation 0.0050
#> 100: 236659 Pennsylvania DEP ToxValue - DTXSID9021390 not specified 0.0003
#> 101: 15292 IRIS - DTXSID9021390 inhalation 0.0003
#> 102: 15293 IRIS - DTXSID9021390 inhalation 0.0003
#> 70 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]
# Set mass by mass units to mg/kg
ccl4_rfd[toxvalUnits %in% c('mg/kg-day', 'mg/kg'), conversion := 1]
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236090 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.002
#> 2: 235305 OW Drinking Water Standar 2018 DTXSID0020446 oral 0.003
#> 3: 235306 OW Drinking Water Standar 2018 DTXSID0020446 oral 0.003
#> 4: 16919 IRIS - DTXSID0020446 oral 0.002
#> 5: 16918 IRIS - DTXSID0020446 oral 0.002
#> ---
#> 203: 236579 Pennsylvania DEP ToxValue - DTXSID9024142 not specified 0.004
#> 204: 235302 OW Drinking Water Standar 2018 DTXSID9024142 oral 0.003
#> 205: 17306 IRIS - DTXSID9024142 oral 0.004
#> 206: 17307 IRIS - DTXSID9024142 oral 0.004
#> 207: 144633 Alaska DEC - DTXSID9024142 oral 0.003
#> 69 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]
ccl4_rfd[toxvalUnits %in% c('mg/kg-day', 'mg/kg'), units := 'mg/kg']
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236090 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.002
#> 2: 235305 OW Drinking Water Standar 2018 DTXSID0020446 oral 0.003
#> 3: 235306 OW Drinking Water Standar 2018 DTXSID0020446 oral 0.003
#> 4: 16919 IRIS - DTXSID0020446 oral 0.002
#> 5: 16918 IRIS - DTXSID0020446 oral 0.002
#> ---
#> 203: 236579 Pennsylvania DEP ToxValue - DTXSID9024142 not specified 0.004
#> 204: 235302 OW Drinking Water Standar 2018 DTXSID9024142 oral 0.003
#> 205: 17306 IRIS - DTXSID9024142 oral 0.004
#> 206: 17307 IRIS - DTXSID9024142 oral 0.004
#> 207: 144633 Alaska DEC - DTXSID9024142 oral 0.003
#> 70 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]Then aggregate the data.
# Run data aggregations grouping by dtxsid and taking either the max or the min
# depending on the toxvalType we are considering.
ccl4_csf[,.(numeric_value = max(toxvalNumeric),
units = toxvalUnits[which.max(toxvalNumeric)]),
by = .(dtxsid)][order(-numeric_value),]
#> dtxsid numeric_value units
#> <char> <num> <char>
#> 1: DTXSID2021028 1.50e+02 (mg/kg-day)-1
#> 2: DTXSID7021029 1.02e+02 (mg/kg-day)-1
#> 3: DTXSID0020573 3.90e+01 (mg/kg-day)-1
#> 4: DTXSID9021390 3.00e+01 (mg/kg-day)-1
#> 5: DTXSID6021032 2.80e+01 (mg/kg-day)-1
#> 6: DTXSID3020702 1.72e+01 (mg/kg-day)-1
#> 7: DTXSID2020684 6.49e+00 (mg/kg-day)-1
#> 8: DTXSID1021798 3.00e+00 (mg/kg-day)-1
#> 9: DTXSID8021062 2.10e+00 (mg/kg-day)-1
#> 10: DTXSID1021409 1.83e+00 (mg/kg-day)-1
#> 11: DTXSID6022422 1.60e+00 (mg/kg-day)-1
#> 12: DTXSID9021427 1.00e+00 (mg/kg-day)-1
#> 13: DTXSID3020203 6.00e-01 (mg/kg-day)-1
#> 14: DTXSID0020600 3.10e-01 (mg/kg-day)-1
#> 15: DTXSID5021207 2.40e-01 (mg/kg-day)-1
#> 16: DTXSID1026164 1.80e-01 (mg/kg-day)-1
#> 17: DTXSID0020153 1.70e-01 (mg/kg-day)-1
#> 18: DTXSID9024142 1.10e-01 (mg/kg-day)-1
#> 19: DTXSID4020533 1.00e-01 (mg/kg-day)-1
#> 20: DTXSID7024241 7.32e-02 (mg/kg-day)-1
#> 21: DTXSID3031864 7.00e-02 (mg/kg-day)-1
#> 22: DTXSID7020005 7.00e-02 (mg/kg-day)-1
#> 23: DTXSID8031865 7.00e-02 (mg/kg-day)-1
#> 24: DTXSID5020601 4.50e-02 (mg/kg-day)-1
#> 25: DTXSID0024341 3.90e-02 (mg/kg-day)-1
#> 26: DTXSID1021407 3.40e-02 (mg/kg-day)-1
#> 27: DTXSID4032611 2.81e-02 (mg/kg-day)-1
#> 28: DTXSID2021317 2.60e-02 (mg/kg-day)-1
#> 29: DTXSID7020637 2.10e-02 (mg/kg-day)-1
#> 30: DTXSID6021030 1.96e-02 (mg/kg-day)-1
#> 31: DTXSID0021541 1.63e-02 (mg/kg-day)-1
#> 32: DTXSID1024338 1.20e-02 (mg/kg-day)-1
#> 33: DTXSID5039224 1.00e-02 (mg/kg-day)-1
#> 34: DTXSID1020437 5.70e-03 (mg/kg-day)-1
#> 35: DTXSID8020090 5.70e-03 (mg/kg-day)-1
#> 36: DTXSID9020243 2.30e-03 (mg/kg-day)-1
#> 37: DTXSID3020833 2.25e-03 (mg/kg-day)-1
#> 38: DTXSID7020215 2.00e-04 (mg/kg-day)-1
#> dtxsid numeric_value units
ccl4_rfc[,.(numeric_value = min(toxvalNumeric*conversion),
units = units[which.min(toxvalNumeric*conversion)]),
by = .(dtxsid)][order(numeric_value),]
#> dtxsid numeric_value units
#> <char> <num> <char>
#> 1: DTXSID5020023 2.0e-05 mg/m3
#> 2: DTXSID3020702 3.0e-05 mg/m3
#> 3: DTXSID7021029 4.0e-05 mg/m3
#> 4: DTXSID2024169 5.0e-05 mg/m3
#> 5: DTXSID0024341 7.0e-05 mg/m3
#> 6: DTXSID2040282 1.0e-04 mg/m3
#> 7: DTXSID8020044 1.0e-04 mg/m3
#> 8: DTXSID9021390 3.0e-04 mg/m3
#> 9: DTXSID0020153 1.0e-03 mg/m3
#> 10: DTXSID8020090 1.0e-03 mg/m3
#> 11: DTXSID3020203 2.0e-03 mg/m3
#> 12: DTXSID3020964 2.0e-03 mg/m3
#> 13: DTXSID8020832 5.0e-03 mg/m3
#> 14: DTXSID3024366 7.0e-03 mg/m3
#> 15: DTXSID5039224 9.0e-03 mg/m3
#> 16: DTXSID7020637 9.0e-03 mg/m3
#> 17: DTXSID5024182 2.0e-02 mg/m3
#> 18: DTXSID6022422 2.0e-02 mg/m3
#> 19: DTXSID0020600 3.0e-02 mg/m3
#> 20: DTXSID4020533 3.0e-02 mg/m3
#> 21: DTXSID5021207 3.0e-02 mg/m3
#> 22: DTXSID2022333 3.5e-02 mg/m3
#> 23: DTXSID3042219 3.5e-02 mg/m3
#> 24: DTXSID4021503 4.0e-02 mg/m3
#> 25: DTXSID0021541 9.0e-02 mg/m3
#> 26: DTXSID8020597 4.0e-01 mg/m3
#> 27: DTXSID1020437 5.0e-01 mg/m3
#> 28: DTXSID0021917 7.0e-01 mg/m3
#> 29: DTXSID3020833 3.0e+00 mg/m3
#> 30: DTXSID2021731 2.0e+01 mg/m3
#> 31: DTXSID6020301 5.0e+01 mg/m3
#> dtxsid numeric_value units
ccl4_rfd[,.(numeric_value = min(toxvalNumeric*conversion),
units = units[which.min(toxvalNumeric*conversion)]),
by = .(dtxsid)][order(numeric_value),]
#> dtxsid numeric_value units
#> <char> <num> <char>
#> 1: DTXSID8031865 3.00e-08 mg/kg
#> 2: DTXSID3031864 1.00e-07 mg/kg
#> 3: DTXSID7021029 4.00e-06 mg/kg
#> 4: DTXSID1024174 3.00e-05 mg/kg
#> 5: DTXSID9023914 3.00e-05 mg/kg
#> 6: DTXSID6024177 5.00e-05 mg/kg
#> 7: DTXSID5020601 8.00e-05 mg/kg
#> 8: DTXSID1021407 1.00e-04 mg/kg
#> 9: DTXSID3020964 4.85e-04 mg/kg
#> 10: DTXSID5020023 5.00e-04 mg/kg
#> 11: DTXSID3020702 9.00e-04 mg/kg
#> 12: DTXSID5021207 1.00e-03 mg/kg
#> 13: DTXSID8020832 1.00e-03 mg/kg
#> 14: DTXSID4022361 1.20e-03 mg/kg
#> 15: DTXSID8023846 1.20e-03 mg/kg
#> 16: DTXSID0020153 2.00e-03 mg/kg
#> 17: DTXSID0020446 2.00e-03 mg/kg
#> 18: DTXSID7024241 3.00e-03 mg/kg
#> 19: DTXSID9024142 3.00e-03 mg/kg
#> 20: DTXSID9021390 4.00e-03 mg/kg
#> 21: DTXSID1024207 5.00e-03 mg/kg
#> 22: DTXSID2040282 5.00e-03 mg/kg
#> 23: DTXSID5024182 5.00e-03 mg/kg
#> 24: DTXSID6021030 5.00e-03 mg/kg
#> 25: DTXSID8020044 5.00e-03 mg/kg
#> 26: DTXSID8020090 7.00e-03 mg/kg
#> 27: DTXSID2020684 8.00e-03 mg/kg
#> 28: DTXSID2022333 1.00e-02 mg/kg
#> 29: DTXSID3042219 1.00e-02 mg/kg
#> 30: DTXSID4021503 1.00e-02 mg/kg
#> 31: DTXSID0024052 2.00e-02 mg/kg
#> 32: DTXSID1026164 2.00e-02 mg/kg
#> 33: DTXSID8023848 2.00e-02 mg/kg
#> 34: DTXSID0021541 2.57e-02 mg/kg
#> 35: DTXSID2021317 3.00e-02 mg/kg
#> 36: DTXSID2024169 3.00e-02 mg/kg
#> 37: DTXSID4020533 3.00e-02 mg/kg
#> 38: DTXSID1021740 5.00e-02 mg/kg
#> 39: DTXSID8022292 5.00e-02 mg/kg
#> 40: DTXSID0021917 6.00e-02 mg/kg
#> 41: DTXSID1024338 8.00e-02 mg/kg
#> 42: DTXSID6022422 8.00e-02 mg/kg
#> 43: DTXSID4022448 1.00e-01 mg/kg
#> 44: DTXSID9020243 1.30e-01 mg/kg
#> 45: DTXSID1020437 1.43e-01 mg/kg
#> 46: DTXSID7020637 2.00e-01 mg/kg
#> 47: DTXSID3020833 4.00e-01 mg/kg
#> 48: DTXSID8020597 1.00e+00 mg/kg
#> 49: DTXSID2021731 2.00e+00 mg/kg
#> dtxsid numeric_value unitsRepeat the process for NATADB, first separating out the relevant subsets of the data.
# Separate out into relevant data subsets
natadb_csf <- natadb_hazard[humanEco %in% 'human health' &
toxvalType %in% c('cancer slope factor') & (toxvalUnits != 'mg/kg-day'), ]
natadb_rfc <- natadb_hazard[humanEco %in% 'human health' &
toxvalType %in% c('RfC'), ]
natadb_rfd <- natadb_hazard[humanEco %in% 'human health' &
toxvalType %in% c('RfD'), ]Now handle the unit conversions.
# Set mass by mass units to mg/kg. Note that ppm is already in mg/kg
natadb_rfc <- natadb_rfc[toxvalUnits != 'ppm',]
natadb_rfd[, units := 'mg/kg-day']
#> id source year dtxsid exposureRoute toxvalNumeric
#> <int> <char> <char> <char> <char> <num>
#> 1: 236090 Pennsylvania DEP ToxValue - DTXSID0020153 not specified 0.00200
#> 2: 144955 Alaska DEC - DTXSID0020448 inhalation 0.00114
#> 3: 236285 Pennsylvania DEP ToxValue - DTXSID0020448 not specified 0.04000
#> 4: 236320 Pennsylvania DEP ToxValue - DTXSID0020523 not specified 0.00200
#> 5: 15643 IRIS - DTXSID0020523 oral 0.00200
#> ---
#> 496: 253058 RSL - DTXSID9021138 oral 0.01000
#> 497: 244593 RSL - DTXSID9021138 oral 0.00100
#> 498: 244592 RSL - DTXSID9021138 oral 0.00100
#> 499: 18173 IRIS - DTXSID9021261 oral 0.00500
#> 500: 18172 IRIS - DTXSID9021261 oral 0.00500
#> 69 variables not shown: [toxvalNumericQualifier <char>, toxvalUnits <char>, studyType <char>, studyDurationClass <char>, studyDuractionValue <num>, studyDurationUnits <char>, strain <char>, sex <char>, population <char>, exposureMethod <char>, ...]Finally, aggregate the data.
# Run data aggregations grouping by dtxsid and taking either the max or the min
# depending on the toxvalType we are considering.
natadb_csf[, .(numeric_value = max(toxvalNumeric),
units = toxvalUnits[which.max(toxvalNumeric)]),
by = .(dtxsid)][order(-numeric_value),]
#> dtxsid numeric_value units
#> <char> <num> <char>
#> 1: DTXSID2020137 5.00e+02 (mg/kg-day)-1
#> 2: DTXSID8020173 2.20e+02 (mg/kg-day)-1
#> 3: DTXSID4021006 1.20e+02 (mg/kg-day)-1
#> 4: DTXSID7021029 1.02e+02 (mg/kg-day)-1
#> 5: DTXSID8020599 6.50e+01 (mg/kg-day)-1
#> 6: DTXSID9020168 4.00e+01 (mg/kg-day)-1
#> 7: DTXSID5020071 2.10e+01 (mg/kg-day)-1
#> 8: DTXSID3020702 1.72e+01 (mg/kg-day)-1
#> 9: DTXSID8021197 1.40e+01 (mg/kg-day)-1
#> 10: DTXSID1020148 1.30e+01 (mg/kg-day)-1
#> 11: DTXSID1020512 1.30e+01 (mg/kg-day)-1
#> 12: DTXSID5024059 1.10e+01 (mg/kg-day)-1
#> 13: DTXSID3020413 7.00e+00 (mg/kg-day)-1
#> 14: DTXSID4021056 6.70e+00 (mg/kg-day)-1
#> 15: DTXSID3020679 6.25e+00 (mg/kg-day)-1
#> 16: DTXSID5020491 4.60e+00 (mg/kg-day)-1
#> 17: DTXSID5020027 4.50e+00 (mg/kg-day)-1
#> 18: DTXSID4020402 4.00e+00 (mg/kg-day)-1
#> 19: DTXSID7020687 4.00e+00 (mg/kg-day)-1
#> 20: DTXSID0039227 3.80e+00 (mg/kg-day)-1
#> 21: DTXSID2020682 3.20e+00 (mg/kg-day)-1
#> 22: DTXSID1021798 3.00e+00 (mg/kg-day)-1
#> 23: DTXSID0021383 2.67e+00 (mg/kg-day)-1
#> 24: DTXSID3020415 2.50e+00 (mg/kg-day)-1
#> 25: DTXSID6020307 2.40e+00 (mg/kg-day)-1
#> 26: DTXSID8021195 2.40e+00 (mg/kg-day)-1
#> 27: DTXSID5024267 2.22e+00 (mg/kg-day)-1
#> 28: DTXSID7021368 2.20e+00 (mg/kg-day)-1
#> 29: DTXSID7023984 2.20e+00 (mg/kg-day)-1
#> 30: DTXSID3025091 1.60e+00 (mg/kg-day)-1
#> 31: DTXSID6022422 1.60e+00 (mg/kg-day)-1
#> 32: DTXSID5020865 1.50e+00 (mg/kg-day)-1
#> 33: DTXSID8021434 1.50e+00 (mg/kg-day)-1
#> 34: DTXSID7020267 1.30e+00 (mg/kg-day)-1
#> 35: DTXSID6020432 1.20e+00 (mg/kg-day)-1
#> 36: DTXSID5020029 1.00e+00 (mg/kg-day)-1
#> 37: DTXSID7020710 8.70e-01 (mg/kg-day)-1
#> 38: DTXSID0020529 8.00e-01 (mg/kg-day)-1
#> 39: DTXSID3020203 6.00e-01 (mg/kg-day)-1
#> 40: DTXSID8021438 6.00e-01 (mg/kg-day)-1
#> 41: DTXSID2021319 5.40e-01 (mg/kg-day)-1
#> 42: DTXSID7021106 4.00e-01 (mg/kg-day)-1
#> 43: DTXSID0020600 3.10e-01 (mg/kg-day)-1
#> 44: DTXSID5020449 2.90e-01 (mg/kg-day)-1
#> 45: DTXSID7021318 2.86e-01 (mg/kg-day)-1
#> 46: DTXSID2021105 2.60e-01 (mg/kg-day)-1
#> 47: DTXSID5021207 2.40e-01 (mg/kg-day)-1
#> 48: DTXSID8020250 2.00e-01 (mg/kg-day)-1
#> 49: DTXSID1022057 1.82e-01 (mg/kg-day)-1
#> 50: DTXSID1026164 1.80e-01 (mg/kg-day)-1
#> 51: DTXSID0020153 1.70e-01 (mg/kg-day)-1
#> 52: DTXSID2021286 1.60e-01 (mg/kg-day)-1
#> 53: DTXSID7020683 1.56e-01 (mg/kg-day)-1
#> 54: DTXSID8020913 1.20e-01 (mg/kg-day)-1
#> 55: DTXSID9020299 1.10e-01 (mg/kg-day)-1
#> 56: DTXSID3039242 1.00e-01 (mg/kg-day)-1
#> 57: DTXSID4020533 1.00e-01 (mg/kg-day)-1
#> 58: DTXSID0020448 9.19e-02 (mg/kg-day)-1
#> 59: DTXSID6020438 9.10e-02 (mg/kg-day)-1
#> 60: DTXSID1020306 8.05e-02 (mg/kg-day)-1
#> 61: DTXSID1020566 8.00e-02 (mg/kg-day)-1
#> 62: DTXSID5020607 7.37e-02 (mg/kg-day)-1
#> 63: DTXSID5021380 7.04e-02 (mg/kg-day)-1
#> 64: DTXSID5021386 7.00e-02 (mg/kg-day)-1
#> 65: DTXSID7020005 7.00e-02 (mg/kg-day)-1
#> 66: DTXSID7020716 6.00e-02 (mg/kg-day)-1
#> 67: DTXSID4020583 4.80e-02 (mg/kg-day)-1
#> 68: DTXSID5020601 4.50e-02 (mg/kg-day)-1
#> 69: DTXSID1020431 4.00e-02 (mg/kg-day)-1
#> 70: DTXSID7020689 4.00e-02 (mg/kg-day)-1
#> 71: DTXSID7026156 3.90e-02 (mg/kg-day)-1
#> 72: DTXSID0021965 2.90e-02 (mg/kg-day)-1
#> 73: DTXSID2020507 2.70e-02 (mg/kg-day)-1
#> 74: DTXSID4039231 2.10e-02 (mg/kg-day)-1
#> 75: DTXSID7020637 2.10e-02 (mg/kg-day)-1
#> 76: DTXSID0021541 1.63e-02 (mg/kg-day)-1
#> 77: DTXSID0020868 1.40e-02 (mg/kg-day)-1
#> 78: DTXSID1021374 1.32e-02 (mg/kg-day)-1
#> 79: DTXSID3020596 1.10e-02 (mg/kg-day)-1
#> 80: DTXSID5039224 1.00e-02 (mg/kg-day)-1
#> 81: DTXSID4020161 8.00e-03 (mg/kg-day)-1
#> 82: DTXSID4021395 7.70e-03 (mg/kg-day)-1
#> 83: DTXSID1020437 5.70e-03 (mg/kg-day)-1
#> 84: DTXSID8020090 5.70e-03 (mg/kg-day)-1
#> 85: DTXSID1020302 3.63e-03 (mg/kg-day)-1
#> 86: DTXSID7021948 3.52e-03 (mg/kg-day)-1
#> 87: DTXSID9020243 2.30e-03 (mg/kg-day)-1
#> 88: DTXSID3020833 2.25e-03 (mg/kg-day)-1
#> 89: DTXSID8020759 1.90e-03 (mg/kg-day)-1
#> dtxsid numeric_value units
natadb_rfc[, .(numeric_value = min(toxvalNumeric),
units = toxvalUnits[which.min(toxvalNumeric)]),
by = .(dtxsid)][order(numeric_value),]
#> dtxsid numeric_value units
#> <char> <num> <char>
#> 1: DTXSID1020516 0.00000200 mg/m3
#> 2: DTXSID7026156 0.00000800 mg/m3
#> 3: DTXSID4024143 0.00001000 mg/m3
#> 4: DTXSID4020874 0.00002000 mg/m3
#> 5: DTXSID5020023 0.00002000 mg/m3
#> 6: DTXSID3020702 0.00003000 mg/m3
#> 7: DTXSID9020293 0.00003000 mg/m3
#> 8: DTXSID7021029 0.00004000 mg/m3
#> 9: DTXSID8042476 0.00010000 mg/m3
#> 10: DTXSID1020273 0.00014501 mg/m3
#> 11: DTXSID2020688 0.00020000 mg/m3
#> 12: DTXSID3020413 0.00020000 mg/m3
#> 13: DTXSID3021932 0.00020000 mg/m3
#> 14: DTXSID5021380 0.00020000 mg/m3
#> 15: DTXSID0024260 0.00030000 mg/m3
#> 16: DTXSID2021157 0.00030000 mg/m3
#> 17: DTXSID4020161 0.00040000 mg/m3
#> 18: DTXSID5020449 0.00050000 mg/m3
#> 19: DTXSID7025180 0.00060000 mg/m3
#> 20: DTXSID7020267 0.00070000 mg/m3
#> 21: DTXSID0020153 0.00100000 mg/m3
#> 22: DTXSID0039229 0.00100000 mg/m3
#> 23: DTXSID1020566 0.00100000 mg/m3
#> 24: DTXSID1023786 0.00100000 mg/m3
#> 25: DTXSID4039231 0.00100000 mg/m3
#> 26: DTXSID8020090 0.00100000 mg/m3
#> 27: DTXSID8020173 0.00141055 mg/m3
#> 28: DTXSID0021383 0.00200000 mg/m3
#> 29: DTXSID0021965 0.00200000 mg/m3
#> 30: DTXSID3020203 0.00200000 mg/m3
#> 31: DTXSID3020964 0.00200000 mg/m3
#> 32: DTXSID5020029 0.00200000 mg/m3
#> 33: DTXSID8020913 0.00300000 mg/m3
#> 34: DTXSID8021432 0.00300000 mg/m3
#> 35: DTXSID5020607 0.00319000 mg/m3
#> 36: DTXSID0020448 0.00400000 mg/m3
#> 37: DTXSID1020148 0.00500000 mg/m3
#> 38: DTXSID8020832 0.00500000 mg/m3
#> 39: DTXSID5020027 0.00600000 mg/m3
#> 40: DTXSID3024366 0.00700000 mg/m3
#> 41: DTXSID6020438 0.00700000 mg/m3
#> 42: DTXSID2021658 0.00800000 mg/m3
#> 43: DTXSID4020583 0.00800000 mg/m3
#> 44: DTXSID3020415 0.00900000 mg/m3
#> 45: DTXSID5039224 0.00900000 mg/m3
#> 46: DTXSID7020637 0.00900000 mg/m3
#> 47: DTXSID1049641 0.01400000 mg/m3
#> 48: DTXSID1022057 0.02000000 mg/m3
#> 49: DTXSID2020711 0.02000000 mg/m3
#> 50: DTXSID2021159 0.02000000 mg/m3
#> 51: DTXSID5020316 0.02000000 mg/m3
#> 52: DTXSID6020569 0.02000000 mg/m3
#> 53: DTXSID6020981 0.02000000 mg/m3
#> 54: DTXSID6022422 0.02000000 mg/m3
#> 55: DTXSID0020600 0.03000000 mg/m3
#> 56: DTXSID3039242 0.03000000 mg/m3
#> 57: DTXSID4020533 0.03000000 mg/m3
#> 58: DTXSID5021207 0.03000000 mg/m3
#> 59: DTXSID6020515 0.03000000 mg/m3
#> 60: DTXSID7020689 0.03000000 mg/m3
#> 61: DTXSID2021319 0.04000000 mg/m3
#> 62: DTXSID4020298 0.05000000 mg/m3
#> 63: DTXSID7020009 0.06000000 mg/m3
#> 64: DTXSID0021541 0.09000000 mg/m3
#> 65: DTXSID2021446 0.10000000 mg/m3
#> 66: DTXSID8020250 0.10000000 mg/m3
#> 67: DTXSID8021434 0.10000000 mg/m3
#> 68: DTXSID9020168 0.11698200 mg/m3
#> 69: DTXSID3021431 0.20000000 mg/m3
#> 70: DTXSID5021124 0.20000000 mg/m3
#> 71: DTXSID8021438 0.20000000 mg/m3
#> 72: DTXSID1020306 0.30000001 mg/m3
#> 73: DTXSID1021827 0.40000001 mg/m3
#> 74: DTXSID8020597 0.40000001 mg/m3
#> 75: DTXSID1020437 0.50000000 mg/m3
#> 76: DTXSID0020868 0.60000002 mg/m3
#> 77: DTXSID0021917 0.69999999 mg/m3
#> 78: DTXSID2020844 0.69999999 mg/m3
#> 79: DTXSID6023947 0.69999999 mg/m3
#> 80: DTXSID1020431 0.80000001 mg/m3
#> 81: DTXSID2021284 1.00000000 mg/m3
#> 82: DTXSID3020596 1.00000000 mg/m3
#> 83: DTXSID8020759 2.00000000 mg/m3
#> 84: DTXSID0021381 2.20000005 mg/m3
#> 85: DTXSID3020833 3.00000000 mg/m3
#> 86: DTXSID5021889 3.00000000 mg/m3
#> 87: DTXSID7021360 5.00000000 mg/m3
#> 88: DTXSID1020302 10.00000000 mg/m3
#> 89: DTXSID2021731 20.00000000 mg/m3
#> dtxsid numeric_value units
natadb_rfd[, .(numeric_value = min(toxvalNumeric),
units = units[which.min(toxvalNumeric)]),
by = .(dtxsid)][order(numeric_value),]
#> dtxsid numeric_value units
#> <char> <num> <char>
#> 1: DTXSID7021029 4.0e-06 mg/kg-day
#> 2: DTXSID1024382 2.0e-05 mg/kg-day
#> 3: DTXSID9020827 2.0e-05 mg/kg-day
#> 4: DTXSID3020679 3.0e-05 mg/kg-day
#> 5: DTXSID7021100 3.0e-05 mg/kg-day
#> ---
#> 101: DTXSID0039229 5.0e-01 mg/kg-day
#> 102: DTXSID8020597 1.0e+00 mg/kg-day
#> 103: DTXSID2020844 1.4e+00 mg/kg-day
#> 104: DTXSID2021159 2.0e+00 mg/kg-day
#> 105: DTXSID2021731 2.0e+00 mg/kg-dayAnswer to Environmental Health Question 8
With these results, we may answer Environmental Health Question 8: When examining different toxicity values, the data may be reported in multiple units. To assess the relative hazard from this data, it is important to take into account the different units and adjust accordingly. List the units reported for the cancer slope factor, reference dose, and reference concentration values associated with the oral and inhalation exposure routes for human hazard. Which chemicals in each data set have the highest cancer slope factor, lowest reference dose, and lowest reference concentration values?
Answer: The units for these three toxicity value types for CCL4 are given by “mg/m3” for RfC, “mg/kg-day”for RfD, and “(mg/kg-day)-1” for Cancer Slope Factor. For NATADB, the units for RfC are given by “mg/m3”, for RfD by “mg/kg-day”, and for Cancer Slope Factor by “(mg/kg-day)-1”. For CCL4, the chemical DTXSID2021028 has the highest CsF at 150 (mg/kg-day)-1, the chemical DTXSID5020023 has the lowest RfC value at 2.0e-5 mg/m3, and the chemical DTXSID8031865 has the lowest RfD value at 3e-8 mg/kg. For NATADB, the chemical DTXSID2020137 has the highest CsF at 500 (mg/kg-day)-1, the chemical DTXSID1020516 has the lowest RfC value at 2.0e-6 mg/m3, and the chemical DTXSID7021029 had the lowest RfD at 4e-6 mg/kg-day.
Concluding Remarks
In conclusion, we explored how one can access publicly available data from the CompTox Chemicals Dashboard programmatically using the CTX APIs via the ctxR R package. In the examples above, we investigated different types of data associated with chemicals, visualized and aggregated the data, and employed different data wrangling techniques using data.tables to answer the proposed environmental health questions. With these tools, one can build workflows that take in a list of chemicals and gather and process data associated with those chemicals through the CTX APIs. Consider how you might use this functionality for building models in your own work.
Try running the same analysis of physical-chemical properties, genotoxicity data, and hazard data on a different pair of data sets available from the CCD. For instance, try pulling the data set ‘BIOSOLIDS2021’ and work through the same steps we completed in this module to investigate the chemicals in this list to gain a better understanding of the associated data.