Modern chemical toxicology is facing a growing need to Reduce, Refine, and Replace animal tests (Russell 1959) for hazard identification. The most common type of animal assays for acute toxicity assessment of chemicals used as pesticides, pharmaceuticals, or in cosmetic products is known as a “6-pack” battery of tests, including three topical (skin sensitization, skin irritation and corrosion, and eye irritation and corrosion) and three systemic (acute oral toxicity, acute inhalation toxicity, and acute dermal toxicity) end points.
We compiled, curated, and integrated, to the best of our knowledge, the largest publicly available data sets and developed an ensemble of quantitative structure–activity relationship (QSAR) models for all six end points. All models were validated according to the Organisation for Economic Co-operation and Development (OECD) QSAR principles, using data on compounds not included in the training sets.
In addition to high internal accuracy assessed by cross-validation, all models demonstrated an external correct classification rate ranging from 70% to 77%. We established a publicly accessible Systemic and Topical chemical Toxicity (STopTox) web portal (https://stoptox.mml.unc.edu/) integrating all developed models for 6-pack assays.
We developed STopTox, a comprehensive collection of computational models that can be used as an alternative to in vivo 6-pack tests for predicting the toxicity hazard of small organic molecules. Models were established following the best practices for the development and validation of QSAR models. Scientists and regulators can use the STopTox portal to identify putative toxicants or nontoxicants in chemical libraries of interest. https://doi.org/10.1289/EHP9341
Historically, regulatory agencies have required animal testing for hazard categorization and labeling (National Research Council Committee on Animals as Monitors of Environmental Hazards 1991). However, there have been multiple calls, especially in the last two decades, to Reduce, Refine, and Replace (three R’s) animal tests for hazard identification (Flecknell 2002; Patlewicz and Fitzpatrick 2016). The U.S. EPA estimated that the cost to approve a single pesticide may reach more than for several animal tests, reaching more than for carcinogenicity in rats or mice (U.S. EPA 2019b). In addition, studies have shown that animal-based assay outcomes do not always equate with human responses (Seok et al. 2013) and that animal models are less reproducible than some alternative methods (Luechtefeld et al. 2016c, 2016a, 2016b). The Strategic Roadmap published by the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) as recently as in 2018 (ICCVAM 2018) called for the development of alternative, “new approach methods” (NAMs), for reducing animal testing of chemical and medical agents. In furthering this call, in September, 2019, the U.S. EPA issued a directive to reduce animal testing, including a commitment to “eliminate all mammal study requests and funding by 2035” (U.S. EPA 2019a). This directive creates a critical need to develop robust in vitro and computational tools for accurate and reliable hazard identification in chemical and pharmaceutical products as part of their regulatory assessment.
Computational approaches, such as structural alerts, read-across, and quantitative structure–activity relationship (QSAR) modeling, have earned broad acceptance as a weight of evidence for assessing chemical toxicity (ECHA 2017; U.S. EPA 2016). Structural alerts are molecular substructures that are associated with a particularly adverse outcome (Norman 2021). Read-across is a technique that proposes to identify potential hazards of untested compounds by associating them with structurally similar compounds that have been tested (Ball et al. 2016). QSAR modeling is a computational approach that employs statistical or machine learning techniques to establish correlations between intrinsic chemical properties (chemical descriptors) and measured properties or toxicological effects (Tropsha and Golbraikh 2007). QSAR modeling has been used extensively to model and predict chemical toxicity, and best practices for model development and validation have been developed to ensure their reliability (Tropsha 2010). Regulators have preferred both structural alerts and read-across approaches due to the ease of use, transparency, and mechanistic interpretability. However, there have been concerns that these tools often do not help with a reliable assessment of whether the underlying compounds present a real hazard to humans and the environment. For instance, we previously demonstrated that alerts have a tendency to flag compounds as toxic even when the experimental evidence shows otherwise (Alves et al. 2016b).
In the last several years, both our (Alves et al. 2018a; Borba et al. 2020; Braga et al. 2017) and other (Roberts et al. 2017; Toropova and Toropov 2017) groups have developed reliable computational models for predicting the skin sensitization potential of chemicals. These and other models developed for one or more of the “6-pack” end points are summarized in Table 1, which indicates that the development of reliable computational models for predicting the outcomes of all 6-pack tests is still a significant challenge. To address this challenge, we compiled, integrated, and curated a collection of experimental in vivo data on 6-pack end points, which, to the best of our knowledge, is the largest 6-pack data set in the public domain. Using this compiled data, we developed and rigorously validated QSAR models for all 6-pack assays and demonstrated their utility in identifying potentially safe or unsafe chemicals in industrial products (Figure 1). In addition, we integrated these models into a software package called STopTox (Systemic and Topical chemical Toxicity (STopTox) and made it publicly available to the research community via a dedicated web portal (https://stoptox.mml.unc.edu/). We especially emphasize, with vivid examples, the importance and impact of data curation on the rigor of our study design and the reliability of the study outcomes.
Table 1 Computational software covering 6-pack end points.
Danish QSAR database
Acute oral, skin irritation and skin sensitization
Consensus model from ACDLabs, Leadscope, CASE Ultra, and SciQSAR
Note: OECD, Organisation for Economic Co-operation and Development; QSAR, quantitative structure–activity relationship; RASAR, read-across structure–activity relationships.
Materials and Methods
We compiled data from multiple publicly available databases and from the literature. These data encompass animal sources of the experimental tests for the following 6-pack end points: a) skin sensitization; b) skin irritation and corrosion; c) eye irritation and corrosion; and acute systemic toxicity via d) dermal, e) inhalation, and f) oral routes. The literature search was conducted using the PubMed database and Chemotext (Capuzzi et al. 2018) with the following search terms: “Skin sensitization” AND/OR “LLNA” AND/OR “QSAR” AND/OR “Read Across”; eye irritation AND/OR “Draize test” AND/OR “QSAR” AND/OR “Read Across”; skin irritation AND/OR “Draize test” AND/OR “QSAR” AND/OR “Read Across”; “acute oral toxicity” AND/OR “QSAR” AND/OR “Read Across”; “acute dermal toxicity” AND/OR “QSAR” AND/OR “Read Across”; and “acute inhalation toxicity” AND/OR “QSAR” AND/OR “Read Across.” No inclusion/exclusion criteria were used, and the last search date was executed in January 2019. All the replicate matches were done using only the standardized chemical structures, never identifiers or simplified molecular input line entry specification (SMILES). The CAS numbers were retrieved from PubChem (https://pubchem.ncbi.nlm.nih.gov/) when not available. All the curated data sets are available in Excel Tables S1–S7.
We extensively cleaned and standardized the data and converted measurements to the same units in each data set employing regular expressions to find essential features for the database that were described in text format; this approach was key to end point classification into Globally Harmonized System of Classification and Labeling of Chemicals (GHS) hazard classes. To convert the data into the binary toxicity calls, we followed the GHS classification criteria: for acute systemic end points GHS classes 1–4 were considered as “toxic,” and class 5 was considered as “not classified.” For skin irritation, classes 1–3 were considered as “irritant or corrosive”; for eye irritation, classes 1–2B were considered “irritants or corrosive”; and for skin sensitization, class 1 was considered “sensitizer.” The criteria for GHS classification are different for each end point and more information can be found elsewhere (UNECE 2019). Following this laborious data preparation and standardization, we conducted both chemical and biological data curation. This requisite attention to detailed data curation at different levels of the data preparation protocol is, unfortunately, uncommon in computational chemical toxicology, as we noted previously (Alves et al. 2019).
Data sets were thoroughly curated following the workflows developed by us earlier (Fourches et al. 2016). First, we excluded inconsistent data, which represented a big share of our data sets (Figure 2). Data were categorized as inconsistent if they were generated not following the OECD protocols; if compounds were not tested in multiple concentrations and could not be classified into GHS classes, labeled as nonexperimental (e.g., labeled as obtained using QSAR and/or read across predictions and/or weight of evidence decisions); if measurements were different from the standard protocols for the 6-pack end points: For systemic end points we only used median lethal dose () measurements; for skin sensitization, we used effective concentration, third percentile () measurements; for skin irritation, we used the mean scores for erythema and edema and reversibility information; and for eye irritation, we used corneal and iritis gradings and reversibility information, according to the GHS classification system (UNECE 2019). Biological data curation was followed by chemical structure curation: We removed mixtures, inorganics, and organometallic compounds; cleaned and neutralized salts; normalized the specific chemotypes; and applied the special treatment to chemicals with multiple replicated records as follows: a) when replicated records presented the same binary outcome, only one record was kept; b) when the majority of replicate chemicals presented the same binary outcome and one had different binary outcome, only one record with the most common binary outcome was kept; and c) when replicated records had different binary outcomes, all of them were removed. All the curated data are available in the Supplementary Material in xlsx format (Excel Tables S1–S7) and can also be downloaded in SDF from the STopTox web portal (https://stoptox.mml.unc.edu/) and GitHub (https://github.com/joyvb/stoptox).
Skin sensitization data were compiled from two sources: a) National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods on behalf of ICCVAM (ICCVAM 2013) and b) the publicly available Registration, Evaluation, Authorisation and restriction of Chemicals (REACH) study results database (ECHA and OECD 2019). The ICCVAM database included 1,060 chemical records with local lymph node assay (LLNA) data. Chemicals were classified as sensitizers/nonsensitizers following the Global Harmonization System (GHS) (UNECE 2019), where the presence of a dose that produces the stimulation index of three () was used as a threshold for a positive response. In other words, compounds without ( are classified as nonsensitizers, whereas those with a reported dose are classified as sensitizers.
After curation, 515 unique compounds (330 sensitizers and 185 nonsensitizers) were retained. The REACH database initially comprised 10,588 records for 9,801 chemicals. The REACH data set is composed of many types of assays and study categories. In vitro and weight of evidence categories were discarded. Data from different OECD skin sensitization assays (OECD guidelines 406, 411, 429 and 442B; OECD 1981, 1992, 2010a, 2010b) were available; only the data corresponding to LLNA assays (429 and 442B) were selected, resulting in 1,275 data points with LLNA records. After curation, 541 compounds (192 sensitizers and 349 nonsensitizers) were retained. Eventually, we merged the curated data from ICCVAM and REACH and examined the content of this combined data. There were 56 groups of replicated chemicals between these two data sets, and the sensitization potential of five of these pairs was different. These discordant records were removed, and only one record for each concordant set of replicates was kept. The merged data set had 1,000 unique compounds (481 sensitizers and 519 nonsensitizers).
Skin irritation and corrosion.
Experimental animal data on skin irritation and corrosion were retrieved from the REACH study results database (ECHA and OECD 2019). After removing inconsistent data, 1,631 out of the original 5,274 data points were left. After removal of mixtures, inorganics, and counter-ions, 1,326 records remained. We followed the GHS (UNECE 2019) to classify the data: If the mean erythema/edema score is bigger than 2.3 and the effects are reversible, the chemical is considered as an irritant. If the effect is irreversible and corrosive reactions are present, the chemical is considered to be corrosive to the skin (OECD 2015).
Among 124 replicate groups of chemicals in the data set, 95 had concordant and 29 had discordant toxicity calls. All the discordant replicates were removed, and only one representative of a pair/pool of concordant replicates was kept. The final data set had 1,012 unique chemical compounds, including 40 corrosives, 277 irritants, and 695 nonirritants. Because there were only a few corrosive compounds in our data set, we decided to merge the corrosive and irritant classes and model only irritant vs. nonirritant compounds. We note that these models have limited regulatory value at the moment with respect to compounds predicted to be toxic, because regulators typically would like to see more granular measurement or prediction at the level of specific subcategories of toxicity. However, we highlight and emphasize that our models make accurate predictions of nontoxic compounds, thereby helping both regulators and respective regulated industries to support the development of safer chemicals. Our resulting data set contained 317 irritants vs. 695 nonirritants. Because the data set was imbalanced, we applied an undersampling technique where the majority class was sampled in a way to match the number of records of the minority class. This sampling was done by searching for the compounds in the majority class that had higher similarity (Tanimoto coefficient) with compounds in the minority class. The balanced data set consisted of 554 compounds (277 irritants and 277 nonirritants).
After we removed the inconsistent data, 7,196 out of the original 7,332 experimental animal data points for eye irritation and corrosion remained. After we removed mixtures, inorganics, and counter-ions, 5,985 records remained. All the discordant replicates were removed, and only one representative of a pair/pool of concordant replicates was kept. The final data set had 3,545 unique chemical compounds, including 1,145 irritants and 2,400 nonirritants. Because the data set was imbalanced, we applied an undersampling technique where the majority class was sampled in a way to match the number of records of the minority class. This sampling was achieved by searching for the compounds in the majority class that had higher similarity (Tanimoto coefficient) with compounds in the minority class. The balanced data set consisted of 2,292 compounds (1,146 skin irritants and 1,146 nonirritants).
Acute dermal toxicity.
The acute dermal toxicity data set was retrieved from the REACH study results database (ECHA and OECD 2019), the publicly available database ToxValDB (Judson 2018), and from the literature (Creton et al. 2010). After removing the inconsistent data, 5,259 out of the original 29,824 data points were left; the major reason for compound removal was the presence of many compounds without a defined (. The GHS was used to classify the chemicals (UNECE 2019). The chemical was labeled as toxic if the was smaller than body weight (BW). After the removal of mixtures, inorganics, and organometallic compounds, 4,601 records remained. Among 1,979 groups of chemical replicates in the data set, 1,836 had concordant toxicity calls, and 143 were discordant. All the discordant replicates were removed, and only one representative of a pair/pool of concordant replicates was kept. The final data set had 2,616 unique chemical compounds, including 382 dermally toxic compounds and 2,234 Not Classified compounds. Because the data set was imbalanced, we applied an undersampling technique where the majority class was sampled to match the number of records of the minority class. This sampling was conducted by searching for the compounds in the majority class that had higher similarity (Tanimoto coefficient) with compounds in the minority class. The balanced data set consisted of 764 compounds, including 382 toxic compounds and 382 Not Classified compounds.
Acute inhalation toxicity.
The acute inhalation toxicity data set was retrieved from the REACH study results database (ECHA and OECD 2019) and from the publicly available database ToxValDB (Judson 2018). The chemicals were classified as toxic according to the GHS thresholds for gases: ; vapors: and dusts/mists: (UNECE 2019). After removing inconsistent data, only 2,061 out of the original 8,176 data points were left. This dramatic reduction of the data set was mainly because of the presence of many compounds without a defined and because of the absence of information regarding the exposure method used (gas, dust, or mist), which is essential for GHS classification. After the removal of mixtures, inorganics, and counter-ions, 1,637 records remained. Among 527 groups of chemical replicates in the data set, 501 had concordant toxicity calls, and 26 were discordant. All the discordant replicates were removed, and only one representative of a pair/pool of concordant replicates was kept. The final data set had 681 unique chemical compounds and was balanced because it included 345 toxic compounds and 336 Not Classified compounds.
Acute oral toxicity.
The acute oral toxicity data set was retrieved from the National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) workshop for the Collaborative Acute Toxicity Modeling Suite (CATMoS) project that our team was part of (ICCVAM 2019; Kleinstreuer et al. 2018; Mansouri et al. 2021). The GHS was used to classify the chemicals (UNECE 2019). If the is smaller than BW, then the chemical was labeled as toxic. After removing inconsistent data, 8,981 out of the original 8,994 data points were left. After removal of mixtures, inorganics, and counter-ions, 8,978 records remained. A total of 406 groups of chemical replicates were found in the data set. All the discordant replicates were removed, and only one representative of a pair/pool of concordant replicates was kept. The final data set has 8,442 unique chemical compounds, including 4,803 toxic compounds and 3,639 Not Classified compounds.
Cosmetic ingredient database (CosIng).
CosIng is the European Commission database for information on cosmetics substances and ingredients (European Commission 2017). This data set contained 5,166 chemical records with a defined chemical structure. After curation, 3,850 unique chemical substances were kept for virtual screening using the developed models. The virtual screening results are available Excel Table S8.
The REACH data come from registration dossiers submitted to the European Chemicals Agency (ECHA) by May 2019 (ECHA 2019). The database contained 20,000 substances, of which 15,438 were chemical records with a defined chemical structure. After curation, 10,465 unique chemical substances were kept for prediction purposes. The virtual screening results are available in Excel Table S9.
Binary QSAR models were developed and rigorously validated according to the best practices of QSAR modeling (Tropsha 2010). Two-dimensional Morgan fingerprints (Aslett et al. 2010) and Molecular ACCess System (MACCS) keys (Anderson 1984), calculated with RDKit package (version 2020.03.1.0), and Mordred, calculated with Mordred package (version 1.2.0) for Python (Moriwaki et al. 2018) were combined with random forest (Breiman 2001) algorithm (RandomForestClassifier) implemented in scikit-learn (version 1.0) (Pedregosa et al. 2012) for model development.
We followed a proper external 5-fold cross-validation procedure. First, the entire data set was split into five parts of the same size. Then, for each iteration, one of these subsets (20% of compounds) was used as a test set, and the other four sets (80% of compounds) were used as the training set. We repeated this procedure five times until each of the five subsets was used once as a test set. In addition, each training set was internally divided into multiple training and validation sets for model training and hyperparameter tuning. The models were generated using only the training set. The true test sets were never employed to generate or to select the models. We repeated this procedure using three different types of descriptors (Morgan, MACCS, and Mordred). The final statistics were based on the consensus (average prediction) of these models. The consensus model considers the majority rule (at least two out of three) for the final classification.
In every case, only the modeling set was used to develop the models, whereas the external sets were used for the evaluation of their predictive power. In addition, 10 rounds of Y-randomization were performed for each data set to ensure that the model performance was not due to chance correlations. The applicability domain (AD) of the models was estimated using the z-cutoff method (Tropsha and Golbraikh 2007) along with dice similarity. In the STopTox web app, the user can visualize the similarity distribution of the training set and how far the query compound is from the threshold (see Figure S1). If the query compound is below the threshold, then it is outside the model’s applicability domain. If it is above the threshold, then it is inside. All the codes used to generate the models are available at https://github.com/joyvb/stoptox.
The predictive performance of QSAR models was evaluated using correct classification rate (CCR), sensitivity (SE), specificity (SP), and positive (PPV) and (NPV) predictive values (Equations 1–5):
where N represents the number of compounds, TP and TN represent the number of true positives and true negatives, and FP and FN represent the number of false positives and false negatives, respectively.
Additional external validation using known toxicants from the literature.
For additional external validation of our models, we conducted a literature search for toxicants that were absent in our database but were later described elsewhere. We used the PubMed database and Chemotext web portal (Capuzzi et al. 2018) with the following search criteria: “Skin sensitization” OR “Skin sensitizers” AND “Clinical studies”/“Skin irritation” OR “Skin irritants” AND “Clinical studies”/“Eye irritation” OR “Eye irritants” AND “Clinical studies”/“Acute oral toxicity” AND “compound” AND “Clinical studies”/“Acute dermal toxicity” AND “compound” AND “Clinical studies”/“Acute inhalation toxicity” AND “compound” AND “Clinical studies.” No inclusion/exclusion criteria were used, and the last search date was executed in July 2020. After collecting these data, we curated the data (see “Data Curation” section) and analyzed it to ensure that they were not included in the modeling data set. Then, we employed STopTox models to predict each toxicant and analyzed whether our models were capable of correctly predicting their toxicity potential.
We applied the developed QSAR models to predict toxicities for compounds included in the CosIng and REACH databases as well as to augment the STopTox data matrix, which is extremely sparse, to identify additional putative toxicants. Both CosIng and REACH databases are described in the “Data Sets” section. The virtual screening results for CosIng, REACH, and STopTox are available in Excel Tables S8, S9, and S10, respectively.
The STopTox web-based application runs machine learning routines written in Python by using Flask (version 2.03; Python Software Foundation), a small framework for creating web microframeworks at the back end. Models were developed using Scikit-Learn (version 1.0) (Scikit-Learn Developers). Angular (version 4) and Typescript were used for the development of the frontend, and Docker 19.03 and Docker-Compose (version 1.27.0) for the orchestration of containers. The developed models and all data sets are publicly available at https://stoptox.mml.unc.edu/.
Statistical characteristics of QSAR models developed in this study are summarized in Table 2. All cross-validated models for the 6-pack end points showed high predictive accuracy on independent external evaluation sets based on several metrics, including CCR, SE, SP, PPV, and NPV. The acute toxicity models showed CCR of 70%–77%; SE of 66%–85%; SP of 66%–80%; PPV of 69%–79%; and NPV of 71%–78%. A literature search executed after models were developed identified toxicants that were absent in our data sets. We used these compounds as an additional validation set for our models (see the section below).
Table 2 Statistical characteristics of QSAR models for 6-pack end points evaluated by 5-fold external cross-validation.
Model Validation with Known Toxicants Not Used in Model Development
For additional external validation of our models, we conducted a literature search for toxicants described in clinical studies or known toxicants for each end point that were absent in our database. We found 45 compounds for skin sensitization, 2 compounds for skin irritation, 3 compounds for eye irritation, 2 compounds for acute dermal toxicity, 5 compounds for acute inhalation toxicity, and 2 compounds for acute oral toxicity.
Altogether, our models correctly predicted 18 out of 25 (72%) of the known toxicants identified in the literature that were not present in our modeling set (see Figure 3 and Excel Table S11).
For skin sensitization, a list of 45 potential skin sensitizers in cosmetic ingredients was compiled by the Norwegian Scientific Committee for Food Safety (Norwegian Scientific Committee for Food Safety 2007). Eleven out of 45 compounds were absent from our skin sensitization data set, and 8 of the 11 chemicals were correctly predicted as sensitizers by our skin sensitization model .
For skin irritation, we found the compounds MS-222, a fish anesthetic commonly used in aquaculture (Park 2019), and sodium lauryl sulfate (De Jongh et al. 2006), a product widely used in personal care products—both known skin irritants that were not present in our skin irritation training data. Our models predicted sodium lauryl sulfate as a skin irritant and MS-222 as not classified (according to OECD Test No. 404 (OECD 2015), chemicals not classified as skin irritants are considered “Not Classified”).
We found three compounds that were not present in our eye irritation data set: glutaraldehyde, glyphosate, and Paraquat (1,1’-Dimethyl-4,4’-bipyridinium dichloride). Our model predicted glutaraldehyde and glyphosate as eye irritants. Exposure to glutaraldehyde during cataract surgery was associated to the development of toxic eye anterior segment syndrome in six patients (Ünal et al. 2006). Ocular glyphosate exposure was reported to be associated with the development of chemosis, heart palpitations, raised blood pressure, headache, and nausea (Bradberry et al. 2004). In two cases of accidental eye exposure to Paraquat, eye damage was reported (Joyce 1969).
For the acute dermal end point, the compounds dichloromethane (Pacheco et al. 2016) and methanol (Kahn and Blum 1979) have been reported as systemic toxicants after dermal exposure and were not present in the modeling set. Our acute dermal toxicity model correctly predicted both compounds as toxic after dermal exposure.
For acute inhalation end point, 20 chemicals commonly present in occupational inhalation accidents were compiled elsewhere (Miller and Chang 2003). There were five organic chemicals in this list that were absent in our acute inhalation data set. All five compounds were correctly predicted by acute inhalation models.
For acute oral end point, we found one clinical case of accidental oral exposure to the pyrethroid deltamethrin that led to the poisoning of a 4-y-old girl who consumed insecticidal chalk and was found unconscious 20 min after going outside to play (O’Malley 1997). We also found that mephedrone, a psychoactive drug, has been proven toxic in a study reporting cases of acute toxicity related to self-reported use of mephedrone (Wood et al. 2010). Our acute oral toxicity model predicted both compounds as toxic if swallowed.
Figure 4 shows the predictions generated for N-phenyl-p-phenylenediamine, a known skin sensitizer usually added to temporary black henna tattoos, leading to many cases of contact allergy (Panfili et al. 2017). We also generated maps showing the relative significance of fragment contributions, providing a graphical interpretation of developed models (Figure 4). Atoms and structural fragments enhancing toxicity are highlighted in pink, and those decreasing toxicity are shown in green. These maps are generated for each of the 6-pack end points independently. Overall, these maps allow the user to analyze the individual contribution of each fragment for acute toxicity, facilitating a mechanistic interpretation of reported predictions.
Virtual Screening of CosIng, REACH, and STopTox Compounds
In the CosIng data set ( compounds), 1,366 compounds were predicted as skin sensitizers, 1,152 compounds were predicted as skin irritants; 1,674 compounds were predicted as eye irritants; 361 compounds were predicted as toxic if swallowed; 301 compounds were predicted as toxic if inhaled; and 257 compounds were predicted as toxic after dermal exposure. Out of 3,850 total compounds, there were 2,695 compounds predicted as toxic in at least one end point and 1,155 compounds predicted as “Not Classified” in all six end points.
In the REACH data set ( compounds), 4,018 compounds were predicted as skin sensitizers; 2,445 compounds were predicted as skin irritants; 4,605 compounds were predicted as eye irritants; 2,679 compounds were predicted as toxic if swallowed; 2,139 compounds were predicted as toxic if inhaled; and 1,899 compounds were predicted as toxic after dermal exposure. There were 7,641 compounds predicted as toxic in at least one end point and 2,824 compounds predicted as “Not Classified” in all six end points.
In the STopTox data set ( compounds with missing toxicity values), 4,792 compounds were predicted as skin sensitizers; 2,491 compounds were predicted as skin irritants; 4,766 compounds were predicted as eye irritants; 5,232 compounds were predicted as toxic if swallowed; 2,394 compounds were predicted as toxic if inhaled; and 2,902 compounds were predicted as toxic after dermal exposure. There were 7,641 compounds predicted as toxic in at least one end point and 2,824 compounds predicted as Not Classified in all six end points.
Model Implementation in the STopTox Web App
The QSAR models were implemented in the STopTox web app (https://stoptox.mml.unc.edu/). STopTox has an intuitive user interface in which the user may draw a compound of interest in the “molecular editor” box or directly paste the SMILES string of the chemical structure of interest. After hitting the “Predict STopTox” button, the user will receive the predicted outcomes (e.g., toxic, nontoxic) using the QSAR models developed for each of the 6-pack acute toxicity end points. For each prediction, we also list its confidence based on how close the compound is to the model AD estimate (Tropsha and Golbraikh 2007); we also provide visual mechanistic interpretation of the prediction using color-coded maps of predicted fragment contribution (Riniker and Landrum 2013). In this algorithm, the predicted contribution of an atom is obtained by accessing the difference in the prediction if each bit corresponding to that atom/fragment is removed. Then, the normalized contribution is used to color the atoms in a topography-like map. Using these maps, the structural fragments predicted to increase the respective toxicity are highlighted in red, and the fragments predicted to decrease toxicity are highlighted in green. Gray isolines define the frontier between the positive (red) and the negative (green) contributions (see Figures 3 and 4). In addition, the prediction confidence, which is estimated by the majority voting of internal models (number of trees) in the random forest algorithm (Breiman 2001), is also available.
Data Curation and QSAR Model Development
Although predictive models have been developed and reported previously for subsets of the 6-pack end points (Table 1), many of these models did not fully comply with the model validation guidelines specified by the OECD (OECD 2004) and, most notably, lacked proper data curation. Notably, in our study, we allocated a significant effort toward the curation of both chemical and biological data using robust protocols established by our group previously (Fourches et al. 2010, 2016). As can be seen in Figure 2, data curation had a dramatic effect on the size of the data: It decreased the size of the available data, in all but one case, by about 52%–92%. Our final database comprised a matrix containing 11,941 compounds with activity measurements for at least one of the 6-pack end points (sparsity degree of 76%).
Previously, we built models to predict skin sensitization end points using a combination of animal (Alves et al. 2015), OECD-validated in vitro assays (Alves et al. 2018a), and human data (Alves et al. 2016a, 2018a; Borba et al. 2020). In this study, we developed QSAR models for predicting skin sensitization testing outcomes using only the LLNA because STopTox is intended as a reliable NAM for the 6-pack assays. The models were thoroughly validated by employing the best practices for model development and validation suggested by the OECD to employ QSAR models for regulatory purposes (OECD 2007). The models showed high accuracy when evaluated by 5-fold external cross-validation and by predicting and additional set of known toxicants external to the models (see the “Results” section). Therefore, all the models reported here were built using only data collected from animal tests that followed the OECD protocols.
Comparative Assessment of New 6-Pack Models vs. Alternative Tools: The Importance of Data Curation
The ECHA database curation proved to be an extremely laborious task and the most time-consuming part of this work. It is important to emphasize that much of the 6-pack end point data included in the ECHA database could not (and should not) be used for model development. As seen from the summary of data curation (Figure 2), the major reduction in the size of individual data sets used eventually for QSAR model development was due to a large fraction of inconsistent data in the original ECHA database. Data were categorized as inconsistent if they were generated not following the OECD protocols, if compounds were tested in few or only one concentration and could not be classified into GHS classes, labeled as nonexperimental (e.g., labeled as obtained using QSAR and/or read across predictions and/or weight of evidence decisions) or found in complex mixtures. In addition, for each end point, we kept only the data containing measurement for the standard OECD protocol (see “Materials and Methods” section). As an example, the report on acute inhalation toxicity for Diboron trioxide (ECHA 2020b) lacks very important information, such as the animal species used for testing, route of administration, and duration of exposure. GHS classification for acute inhalation toxicity depends on the route of administration (UNECE 2019), making it difficult to classify this compound as toxic or nontoxic. The OECD guidelines also state that the test must be done in rats with 4 h of exposure to the tested chemical. Calcium iodate, an inorganic, was reported as an eye irritant from category 2A based on a QSAR prediction (ECHA 2020a). ECHA presented other reports showing clinical evidence of eye irritation in humans. Still, because categorization is done based on animal tests, we could not trust these data for modeling. There were many other examples of incomplete/nontrustable reports from the databases, which significantly decreased the data set size.
Comparison of our results with models developed for the same end points without rigorous data curation (Luechtefeld et al. 2018) suggests that our extensive data curation procedures resulted in the decreased data set size and, formally, lower than reported model performance. Indeed, we compared models produced in this study to those reported by Luechtefeld et al. (2018), who described the development of a suite of in silico models, termed read-across structure–activity relationships (RASAR) for the 6-pack end points. Because the model predictions based on RASAR could only be accessed through a fee-based commercial platform (https://www.ulreachacross.com, which now is defunct), we performed an indirect comparison of the respective statistics (see Table 3). Our models showed, on average, a 10% lower CCR. The amount of data reported in the study mentioned above (Luechtefeld et al. 2018) was, on average, five times larger than the size of the carefully curated data set used in this study. Previously, we already expressed concerns that the high accuracy of models as reported (Luechtefeld et al. 2018) could be the consequence of inadequate data curation, leaving many replicate compounds in the modeling and validation data sets (Alves et al. 2021). We posit that our results reflect the actual model performance for these end points more accurately because we eliminated such confounders as replicate entries or the use of predicted or “not reliable” values and conducted more rigorous validation procedures according to the established guidelines. We strongly suggest that our exercise reemphasizes the importance of proper data curation and cautions against overinterpreting results from models built on noncurated data sets.
Table 3 Indirect comparison of STopTox (5-fold external cross-validation) and RASAR (as reported in the original publication).
Number of chemicals
Note: CCR, correct classification rate; RASAR, read-across structure–activity relationships. *Data retrieved from (Luechtefeld et al. 2018).
STopTox Usability and Interpretation
It is essential to note that, if the model predicts a compound as toxic or nontoxic, such prediction should be considered only in the context of specific dose-dependent observation for each assay; obviously, increasing the dose of any compound in any assay could often lead to toxic effects. For instance, the skin sensitization potencies for substances are based on a function of lymph node cell proliferation induced by the test chemical and expressed as a stimulation index (SI) relative to values obtained with concurrent controls. If , the substance is considered as a sensitizer in the tested concentration. Similar considerations were applied in transforming the results of measurement into binary format for other end points.
These considerations are often overlooked when making predictions or assertions concerning the expected chemical toxicity. The ultimate goal of any method for evaluating acute toxicity is to provide an accurate assessment of the potential risk of a chemical concerning human safety (Basketter et al. 2015). Therefore, we reinforce that the limitation of assays should influence both the interpretation of the predictions made by the models and the use of these models to help toxicologists in their decision-making. Predictions with QSAR models implemented in STopTox (actually, with any models) do not take the dose into account; they merely state whether a chemical is predicted to be toxic or nontoxic in each assay. Thus, users interpreting these predictions should always be familiar with and keep in mind the underlying experimental conditions under which compounds in the training sets were denoted as toxic or nontoxic. Further, these models are limited to binary hazard-based predictions, rather than providing information on potency and GHS or U.S. EPA subcategorization. Therefore, they are not directly applicable for many regulatory classifications and labeling requirements requiring a higher level of granularity. However, these models are well suited to assist in hazard assessment and chemical screening/prioritization, and, because of their high accuracy in terms of both sensitivity and specificity, they can be instrumental in identifying nontoxic compounds (tested in the same conditions as those identified as toxic where additional subcategorization is indeed necessary).
Virtual Screening of CosIng, REACH, and STopTox Compounds
As a case study illustrating STopTox usability, we applied our QSAR models to the European Commission CosIng database, REACH, and STopTox data matrix (sparsity degree of 76%), including AD estimation. Most compounds in each of these data sets were predicted as “Not Classified” by each individual model. The predictions of acute toxicity of these databases illustrate QSAR models’ utility for prioritizing chemicals of concern for targeted biological testing in different chemical spaces such as cosmetics, pesticides, and industrial chemicals (Alves et al. 2018b). All compounds and corresponding predictions are available in the Supplemental Material.
STopTox is a comprehensive collection of computational models that can be used as an alternative to in vivo 6-pack tests for predicting chemical toxicity hazard. Models were established following the best practices for the development and validation of QSAR models (OECD 2004; Tropsha 2010) using the largest publicly available and carefully curated data sets that we compiled for all 6-pack assays. To the best of our knowledge, STopTox is the first publicly available portal that enables accurate prediction of chemical hazards in all the 6-pack end points at once using a model developed with transparent approaches and carefully curated data. Despite the model limitations concerning potency classes, they are reliable for predicting chemicals that do not require regulatory classification, such as in the early stages of drug discovery (Hasselgren and Myatt 2018). We suggest that these models are valuable for both regulatory agencies and respective industries in helping them identify safer chemicals using inexpensive in silico alternatives to in vivo testing of chemicals of interest. We reinforce that, to build predictive models, it is not enough just to use adequate chemical descriptors and powerful machine learning algorithms (Fourches et al. 2016); we shall stress that STopTox is the only 6-pack end point predictor in the public domain developed with extensively curated data and OECD-compliant modeling approaches. The STopTox web app provides users with access to statistically significant and externally predictive QSAR models of acute toxicity tests. The web app can rapidly evaluate acute toxicity hazards in chemical inventories. STopTox is freely available at https://stoptox.mml.unc.edu/. To the best of our knowledge, STopTox does not have analogs in terms of the level of data curation, validated statistical accuracy of constituting models, transparency of the data, modeling methods and software tools, and public accessibility.
Supplemental Material includes curated data sets for each of the 6-pack end points and results for the virtual screening of the STopTox matrix, and CosIng, and REACH databases in xlsx format.
This study was supported by National Institutes of Health (NIH) (grants 1U01CA207160, R41ES033589, and 1R43ES032371) and CNPq (grant 400760/2014-2). J.B. thanks the CNPq and the Science without Borders program for the financial support of her visit to the University of North Carolina at Chapel Hill. V.A. thanks the Lush Prize.
Each author has contributed significantly to this work. J.V.B.B., V.M.A., C.H.A., E.N.M., and A.T. conceived and designed the study. J.V.B.B., K.O., A.C.S., S.U.S.H., and E.O. curated the data and developed the models. J.V.B.B., V.M.A., N.K., J.S., D.A., C.H.A., E.N.M., and A.T. analyzed the data. R.B. and D.K. incorporated the models into the STopTox web application. J.V.B.B., V.M.A., and E.N.M. wrote the first draft of the manuscript. All authors read, edited, and approved the final manuscript.
A.T., V.M.A., and E.N.M. are co-founders of Predictive, LLC, which develops computational methodologies and software for toxicity prediction. All other authors declare they have nothing to disclose.
Alves VM, Capuzzi SJ, Braga RC, Borba JVB, Silva AC, Luechtefeld T, et al. 2018a. A perspective and a new integrated computational strategy for skin sensitization assessment. ACS Sustainable Chem Eng 6(3):2845–2859, https://doi.org/10.1021/acssuschemeng.7b04220.
Barroso J, Pfannenbecker U, Adriaens E, Alépée N, Cluzel M, De Smedt A, et al. 2017. Cosmetics Europe compilation of historical serious eye damage/eye irritation in vivo data analysed by drivers of classification to support the selection of chemicals for development and evaluation of alternative methods/strategies: the Draize eye test Reference Database (DRD). Arch Toxicol 91(2):521–547. https://www.ncbi.nlm.nih.gov/pubmed/26997338, https://doi.org/10.1007/s00204-016-1679-x.
National Research Council Committee on Animals as Monitors of Environmental Hazards. 1991. Animals as Sentinels of Environmental Health Hazards. Washington, DC: National Academies Press. https://www.ncbi.nlm.nih.gov/books/NBK234944/ [accessed 30 July 2021].
Pacheco C, Magalhães R, Fonseca M, Silveira P, Brandão I. 2016. Accidental intoxication by dichloromethane at work place: clinical case and literature review. J Acute Med 6(2):43–45, https://doi.org/10.1016/j.jacme.2016.03.008.
Panfili E, Esposito S, Di Cara G. 2017. Temporary black henna tattoos and sensitization to para-Phenylenediamine (PPD): two paediatric case reports and a review of the literature. Int J Environ Res Public Health 14(4):421. 28420106, https://doi.org/10.3390/ijerph14040421.
UNECE (United Nations Economic Commission for Europe). 2019. GHS (Rev.8) (2019): Globally Harmonized System of Classification and Labelling of Chemicals (GHS). Part 3.3. https://unece.org/ghs-rev8-2019 [accessed 30 December 2017].
National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina, USA
Address correspondence to Eugene N. Muratov and Alexander Tropsha, 100K Beard Hall, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC 27599 USA. Email: [email protected] and [email protected]
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click DOWNLOAD.
Pokharkar O, Anumolu H, Zyryanov G, Tsurkan M, Natural Products from Red Algal Genus Laurencia as Potential Inhibitors of RdRp and nsp15 Enzymes of SARS-CoV-2: An In Silico Perspective, Microbiology Research, 10.3390/microbiolres14030069, 14, 3, (1020-1048), (2023).
Pokharkar O, Lakshmanan H, Zyryanov G, Tsurkan M, Antiviral Potential of Antillogorgia americana and elisabethae Natural Products against nsp16–nsp10 Complex, nsp13, and nsp14 Proteins of SARS-CoV-2: An In Silico Investigation, Microbiology Research, 10.3390/microbiolres14030068, 14, 3, (993-1019), (2023).
Wong F, Lee Y, Ong J, Manan F, Sabri M, Chai T, Exploring the Potential of Black Soldier Fly Larval Proteins as Bioactive Peptide Sources through in Silico Gastrointestinal Proteolysis: A Cheminformatic Investigation, Catalysts, 10.3390/catal13030605, 13, 3, (605), (2023).
EL Haddoumi G, Mansouri M, Bendani H, Bouricha E, Kandoussi I, Belyamani L, Ibrahimi A, Facing Antitubercular Resistance: Identification of Potential Direct Inhibitors Targeting InhA Enzyme and Generation of 3D-pharmacophore Model by in silico Approach, Advances and Applications in Bioinformatics and Chemistry, 10.2147/AABC.S394535, Volume 16, (49-59), (2023).
Haddoumi G, Mansouri M, Bendani H, Chemao-Elfihri M, Kourou J, Abbou H, Belyamani L, Kandoussi I, Ibrahimi A, Selective Non-toxics Inhibitors Targeting DHFR for Tuberculosis and Cancer Therapy: Pharmacophore Generation and Molecular Dynamics Simulation, Bioinformatics and Biology Insights, 10.1177/11779322231171778, 17, (117793222311717), (2023).
Di Stefano M, Galati S, Piazza L, Granchi C, Mancini S, Fratini F, Macchia M, Poli G, Tuccinardi T,
VenomPred 2.0: A Novel
Platform for an Extended and Human Interpretable Toxicological Profiling of Small Molecules
, Journal of Chemical Information and Modeling, 10.1021/acs.jcim.3c00692, (2023).
Ahmed S, Rahman M, Alqahtani A, Sultana N, Almarfadi O, Ali M, Lee J, Anticancer potential of phytochemicals from Oroxylum indicum targeting Lactate Dehydrogenase A through bioinformatic approach, Toxicology Reports, 10.1016/j.toxrep.2022.12.007, 10, (56-75), (2023).
Cui S, Gao Y, Huang Y, Shen L, Zhao Q, Pan Y, Zhuang S, Advances and applications of machine learning and deep learning in environmental ecology and health, Environmental Pollution, 10.1016/j.envpol.2023.122358, 335, (122358), (2023).
Chushak Y, Gearhart J, Clewell R, Structural alerts and Machine learning modeling of “Six-pack” toxicity as alternative to animal testing, Computational Toxicology, 10.1016/j.comtox.2023.100280, 27, (100280), (2023).