CATMoS: Collaborative Acute Toxicity Modeling Suite

Background: Humans are exposed to tens of thousands of chemical substances that need to be assessed for their potential toxicity. Acute systemic toxicity testing serves as the basis for regulatory hazard classification, labeling, and risk management. However, it is cost- and time-prohibitive to evaluate all new and existing chemicals using traditional rodent acute toxicity tests. In silico models built using existing data facilitate rapid acute toxicity predictions without using animals. Objectives: The U.S. Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) Acute Toxicity Workgroup organized an international collaboration to develop in silico models for predicting acute oral toxicity based on five different end points: Lethal Dose 50 (LD50 value, U.S. Environmental Protection Agency hazard (four) categories, Globally Harmonized System for Classification and Labeling hazard (five) categories, very toxic chemicals [LD50 (LD50≤50mg/kg)], and nontoxic chemicals (LD50>2,000mg/kg). Methods: An acute oral toxicity data inventory for 11,992 chemicals was compiled, split into training and evaluation sets, and made available to 35 participating international research groups that submitted a total of 139 predictive models. Predictions that fell within the applicability domains of the submitted models were evaluated using external validation sets. These were then combined into consensus models to leverage strengths of individual approaches. Results: The resulting consensus predictions, which leverage the collective strengths of each individual model, form the Collaborative Acute Toxicity Modeling Suite (CATMoS). CATMoS demonstrated high performance in terms of accuracy and robustness when compared with in vivo results. Discussion: CATMoS is being evaluated by regulatory agencies for its utility and applicability as a potential replacement for in vivo rat acute oral toxicity studies. CATMoS predictions for more than 800,000 chemicals have been made available via the National Toxicology Program’s Integrated Chemical Environment tools and data sets (ice.ntp.niehs.nih.gov). The models are also implemented in a free, standalone, open-source tool, OPERA, which allows predictions of new and untested chemicals to be made. https://doi.org/10.1289/EHP8495


Introduction
Acute systemic toxicity studies are required by regulators around the world to inform chemical hazard classification, labeling, and risk management. The testing to assess acute systemic toxicity is conducted in vivo through a predefined route of exposure (oral, dermal, or via inhalation) during a fixed observation period as described in test guidelines issued by the Organization for Economic Cooperation and Development (OECD) (OECD 2002a(OECD , 2002b(OECD , 2002c(OECD , 2008. Five U.S. agencies [Consumer Product Safety Commission (CPSC), Department of Defense (DoD), Department of Transportation (DoT), Environmental Protection Agency (U.S. EPA), Occupational Safety and Health Administration (OSHA)], as well as Registration, Evaluation, Authorization, and Restriction of Chemicals (REACH) in Europe use the median Lethal Dose 50 (LD 50 ; the dose of a substance that would be expected to kill half the animals in a test group) from acute oral toxicity data for the classification and labeling of chemical substances (ECHA 2008;Kleinstreuer et al. 2018;Strickland et al. 2018). However, in vivo acute oral toxicity testing is cost-and time-prohibitive and raises ethical concerns related to the use of many animals. Given the large number of new and existing substances requiring assessment, there is a pressing need for cost-effective and rapid nonanimal alternatives.
Recent technological advances in computational resources and artificial intelligence have increased the accuracy and speed of machine learning algorithms. As a result, in silico approaches such as quantitative structure-activity relationships (QSARs) are being increasingly recognized as alternatives to bridge the lack of knowledge about chemical properties and their biological activities. QSARs are being promoted for their ability to accurately predict toxicological end points at low cost but also for being reliable, reproducible, and broadly applicable to the diversity of chemicals requiring testing (Dearden et al. 2009;Worth et al. 2005). Consequently, the integration of nonanimal methods for assessing chemical toxicity is gaining momentum. In Europe, REACH regulations call for the use of nonanimal methods to assess chemical toxicity (Benfenati et al. 2011;European Commission, Environment Directorate General 2007;Lahl and Hawxwell 2006). Similarly, in 2020, U.S. EPA created a New Approach Methods (NAMs) Work Plan to prioritize agency efforts and resources toward activities that will reduce the use of animal testing while continuing to protect human health and the environment (U.S. EPA 2020). Furthermore, the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM), consisting of representatives from 16 U.S. federal agencies, has several workgroups focused on the development or validation of NAMs. These workgroups contribute to the goals of the ICCVAM Strategic Roadmap for Establishing New Approaches to Evaluate the Safety of Chemicals and Medical Products in the United States (Interagency Coordinating Committee on the Validation of Alternative Methods 2018). One of the ICCVAM ad hoc workgroups established was the Acute Toxicity Workgroup (ATWG), which sought to develop an implementation plan for identifying, evaluating, and applying alternative methods for acute systemic toxicity (Kleinstreuer et al. 2018;Lowit et al. 2017). An initial ATWG study was conducted to assess the acute toxicity data regulatory requirements, needs, and decision contexts of member agencies as well as to understand the current acceptance of alternative methods (Strickland et al. 2018). Subsequent charges of the ATWG were to identify, acquire, and curate high-quality data from reference test methods that could be used to evaluate existing models for acute toxicity as well as investigate the feasibility of developing new models. Focusing initially on the oral route of exposure to evaluate existing in silico models, the ATWG organized an international collaborative project to develop new in silico models for predicting acute oral systemic toxicity (Kleinstreuer et al. 2018;Strickland et al. 2018).
International consortia have successfully developed collaborative computational solutions for challenging toxicological problems. Examples in the area of endocrine disruption screening include the Collaborative Estrogen Receptor Activity Prediction Project (CERAPP) (Mansouri et al. 2016a) and the Collaborative Modeling Project for Androgen Receptor (CoMPARA) (Mansouri et al. 2020). The predictive consensus models from these projects have been integrated to assess the endocrine activity potential of organic chemicals within the EPA's Endocrine Disruptor Screening Program (EDSP) (U.S. EPA-NCCT 2014b). The global network of experts represented by these successful consortia was leveraged for the current acute oral systemic toxicity modeling project, and the legacy workflows from CERAPP and CoMPARA were adapted and applied for the data analysis and modeling conducted herein.
For the current project, the U.S. National Toxicology Program (NTP) Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM) and the U.S. EPA's Center for Computational Toxicology and Exposure (CCTE) collected and curated rat oral LD 50 data for more than 15,000 substances from public sources to produce data sets that were used during the project as training and evaluation sets ; Kleinstreuer et al. 2018). Thirty-five international collaborators representing various sectors, including government, industry, and academia, participated in this effort, which produced a total of 139 different models. All submitted models were both quantitatively and qualitatively evaluated. A workshop was convened (https:// ntp.niehs.nih.gov/go/atwksp-2018) to bring contributing computational modelers and regulatory decision makers together to discuss the feasibility of using in silico predictions for regulatory use in lieu of in vivo acute oral systemic toxicity testing (Kleinstreuer et al. 2018). Ultimately, predictions within the applicability domains of the developed models were combined into consensus predictions based on a weight-of-evidence (WoE) approach, forming the Collaborative Acute Toxicity Modeling Suite (CATMoS). CATMoS was then implemented into the open-source, open-data OPERA [OPEn (q)saR App] tool to enable further screening of new chemicals (Mansouri et al. 2016b. This paper provides a description of the data on which the CATMoS models are based, the evaluation process, and development of consensus models. We close with a discussion of the limitations of CATMoS and a description of implementation and additional evaluation of the model.

U.S. Regulatory Uses for Acute Oral Toxicity Data
Prior to identifying any existing alternative methods or investing in the development, optimization, and validation of new ones, it is important to understand the current regulatory needs and decision contexts, including the use and acceptance of nonanimal data for the toxicological end point of concern. Strickland et al.
(2018) described the use of acute oral toxicity data by ICCVAM regulatory agencies to provide a basis for identifying opportunities for flexibility with regard to replacing or reducing the need for in vivo acute oral toxicity studies (Strickland et al. 2018). The regulatory needs of these agencies require three different types of acute toxicity outcomes, as detailed in Table 1: a) an LD 50 value estimate; b) a binary outcome based on a single threshold; and c) a multiclass scheme based on different thresholds. Two binary models were relevant to U.S. agencies: a) the identification of whether a chemical was "very toxic" (i.e., LD 50 ≤ 50 mg=kg); and b) identification of whether a chemical was "nontoxic" (i.e., LD 50 > 2,000 mg=kg). Multiclass schemes in use by several agencies included hazard categories defined by the U.S. EPA and the U.N. Globally Harmonized System of Classification and Labeling of Chemicals (GHS), which consist of four or five categories, respectively (Table 2) (Strickland et al. 2018).
Based on this information, for this project we asked participants to develop models to predict one or more of the following end points: • Very toxic ( Based on ATWG feedback and to define the representative LD 50 as a protective value while accounting for the distribution across multiple LD 50 s, the median of the lowest quartile was computed using only discrete LD 50 values (omitting limit test data and range and confidence interval data). A detailed summary of the data compilation is available online on the NICEATM webpage dedicated to the collaborative modeling project (NTP 2020).
To obtain chemical structure information, CASRNs served as identifiers to search the U.S. EPA's DSSTox database hosted in the CompTox Chemicals Dashboard (Grulke et al. 2019;Richard and Williams 2002;U.S. EPA-NCCT 2014a;Williams et al. 2017) as well as other cross-checked online databases: ChemIDPlus (NIH 2016), PubChem (Bolton et al. 2008) and ChemSpider (Royal Society of Chemistry 2015). The collected structures were then processed using a standardization workflow developed for the purpose of generating QSAR-ready structures compatible with most modeling approaches (Mansouri et al. 2016a;McEachran et al. 2018). In fact, this workflow was first developed in KNIME (Berthold et al. 2008) for the CERAPP project and was also employed for CoMPARA (Mansouri et al. 2016a(Mansouri et al. , 2020. The workflow is a multistep process that includes: • A filter to remove inaccurate chemical representations, inorganics/metallo-organics, mixtures, and general representations that are not specific (Markush structures, repeating monomers, connection points) • A standardization step for ring representations, isomers/ mesomers, and other tautomeric forms • A step to identify salts/solvents, counterions, and duplicate structures. The workflow can be downloaded from GitHub or KNIME hub as KNIME workflow or used in command line within a Docker container (https://github.com/NIEHS/QSAR-ready, https://kni.me/w/_ iyTwvXi6U3XTFW1, https://hub.docker.com/r/kamelmansouri). After the standardization process, the final data set included 11,992  chemical structures amenable for the modeling project with encoded SMILES and SDF formats. This data set, complete with chemical structures and representative LD 50 values, was further processed to ensure that each chemical had only one representative value/call for all remaining modeling end points (i.e., the different binary and multiclass categories). For categorical designations, limit test and range/confidence interval LD 50 data were integrated wherever possible. As such, the number of chemicals with end point values/calls varies across the modeling end points (Table 3) based on whether chemicals had representative LD 50 values and/or other data. If a chemical had both, the representative value was used to determine categorical calls rather than limit test data. Some data were useable only for certain end points and categories depending on the thresholds. For example, ranges that spanned multiple hazard categories were considered only for the binary (very toxic/nontoxic) end points and omitted from rendering a determination for hazard category assignment.
Training and evaluation sets: source, compilation, splitting. The final data set comprising 11,992 chemicals was split into training and evaluation sets consisting of 75% (8,994 chemicals) and 25% (2,895 chemicals), respectively. This process was performed semirandomly by ordering chemicals based on the five end points (categorical and continuous LD 50 values) and partitioning every fourth record into the evaluation set accordingly. This approach was taken to ensure an equivalent distribution of the LD 50 values and the different hazard categories between the training and the evaluation sets without supervised sampling of the chemicals based on structures ( Figure 1A). The sources of the chemical structures were also kept equivalent between the two sets, with chemical structures obtained from DSSTox, being the highest quality, representing over 75% of each set ( Figure 1B).
The training set (Supplemental Material 1: TrainingSet.sdf, TrainingSet.xlsx, TrainingSet_Original.sdf) was made available for collaborators on the project webpage along with an explanation of the proper use of the data for the modeling steps (https://ntp.niehs. nih.gov/iccvam/methods/acutetox/model/qna.pdf). Modelers were encouraged to use the provided training set but were given the flexibility to make any modifications and apply post processing to suit their own modeling approaches. For example, modelers might choose to augment the data set with additional toxicity data or use undersampling approaches to reduce the number of low-potency chemicals to achieve a more balanced data set.
The empirical data (Supplemental Material 2: EvaluationSet. sdf, EvaluationSet.xlsx, EvaluationSet_Original.sdf) of the evaluation set were initially withheld from the project website so that NICEATM could perform an independent assessment of the validity of the models submitted by the collaborators. The chemical structures of this evaluation set were, however, provided to the participants to generate model predictions that would serve as an external validation set. These structures were provided as part of a much larger prediction set of chemical structures, as described below, ensuring that participants were not privy to the identities of the evaluation set chemicals during model development.
Prediction set: structure collection and curation. The list of evaluation set chemicals was contained within a comprehensive chemical list for which participants were asked to generate predictions using their optimized models. This prediction set encompassed lists of interest to the ICCVAM ATWG regulatory agencies who use acute oral LD 50 data, as well as to stakeholders and other chemical screening programs, including ToxCast™/ Tox21, EDSP, the Toxic Substances Control Act (TSCA), and a general list of substances on the market from the U.S. EPA CompTox chemicals dashboard (Dix et al. 2007;Grulke et al. 2019 p. 21;Kavlock et al. 2012;U.S. EPA-NCCT 2014b. The QSAR-ready KNIME workflow was applied to standardize the chemical structures and remove duplicates. After integration of the evaluation set, the final prediction set (see Supplemental Material 3) included 48,137 chemical structures (including the hidden evaluation set) and was made available for download on the project webpage (NTP 2020). Participating  groups were encouraged to generate predictions for as many chemicals as possible.

Participants and Modeling Methods
Multiple modeling approaches were applied by the 35 international participating groups to predict the above mentioned acute oral toxicity end points. The list of participating groups with the abbreviations that are used in this manuscript to identify the different models is provided in Table 4. The list of participants is provided in Supplemental Material 4. For transparency, the modelers were encouraged to use the provided training set and apply free and open-source tools to develop new models. However, the use of existing and proprietary commercial tools and/or other data was also permitted. The various molecular descriptors/tools and modeling approaches employed are summarized in Tables 4  and 5. For further information about the methods and detailed descriptions of modeling processes as provided by the participating groups, see https://doi.org/10.22427/NTP-DATA-002-00090-0001-0000-2 and the respective modeling references in Table 4.

Evaluation Procedure
The project timeline and guidelines for submission were published in an online document posted to the project webpage on the NICEATM website (NTP 2020). The guidelines included recommendations about the modeling process as well as detailed instructions about information to be included with each submission. Qualitative and quantitative evaluation procedures for the submitted models and predictions were based on the five OECD principles for QSAR modeling (OECD 2005(OECD , 2007OECD n.d.) Models and predictions were evaluated by an organizing committee of scientists from NICEATM, CCTE, and industry. Qualitative evaluation. The qualitative evaluation process assessed the transparency of the submitted models. The criteria used for this evaluation ( Table 6) satisfied four of the five OECD principles and added a category for general documentation, as shown in Table 6. Participants who did not provide sufficient information for analysts to understand and interpret their results were asked either to provide additional clarification until all requirements were met or to withdraw/resubmit the model.
Quantitative evaluation. This step of evaluation satisfied the OECD principle of QSAR validation addressing appropriate measures of goodness-of-fit, robustness, and predictivity. To be fully inclusive for high-and low-throughput modeling approaches, the participating groups were not required to predict the entire prediction set but were encouraged to provide predictions for as many chemicals as possible. This approach was designed to ensure sufficient predictions for the 2,895 evaluation set chemicals that were hidden within the 48,137 structures of the prediction set. Although this flexibility could lead to models being evaluated for varying portions of the evaluation set, the results of the evaluation set were used for comparison purposes and to check for mistakes and mismatches so any corrections could be made prior to consensus modeling.
The quantitative evaluation considered only predictions within the applicability domain (AD) of the models. Models predicting the binary and multiclass end points were evaluated separately from those used to evaluate models predicting discrete LD 50 values using appropriate statistical parameters. The parameters of the scoring functions included the three criteria from the OECD principles: • Goodness of fit: statistics on the training set (Tr) • Predictivity: statistics on the evaluation set (Eval) • Robustness: balance between goodness of fit and predictivity. Based on these parameters, each model produced a score (S) ranging from 0 to 1 for predictions of chemicals within its AD.
This score was used in the consensus modeling step as a weighting scheme. The parameter multipliers (for the global and subparameter functions) were assigned based on importance to the evaluation procedure as established in the CoMPARA project (Mansouri et al. 2020): S = 0:3 × ðGoodness of fitÞ + 0:45 × ðPredictivityÞ + 0:25 × ðRobustnessÞ (1) Quantitative evaluation of binary and multi-class models. The performance of models for binary and multiclass end points was evaluated using statistical indices proposed in the literature (Consonni et al. 2009;Dearden et al. 2009;Todeschini et al. 2016). The indices used were calculated from a confusion matrix, which summarizes the number of observed and predicted classes in the rows and columns, respectively. For the current evaluation, classifications based on experimental LD 50 data were used as truth. The classification parameters were defined using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The performance measures calculated for consideration during the evaluation step included balanced accuracy (BA), specificity (Sp), and sensitivity (Sn).
BA is given by: Sn, or True Positive Rate (TPR), is given by: and the Sp, or True Negative Rate (TNR), is given by: For multicategory end points, these parameters were calculated for each category and then averaged. The balance between Sn and Sp was also included in the calculation of goodness of fit and predictivity. The three parameters of the scoring function S were calculated as follows: Goodness of fit = 0:7 × ðBA Tr Þ + 0:3 × ð1 − g jSn Tr − Sp Tr jÞ: Evaluation set predictivity = 0:7 × ðBA Eval Þ + 0:3 Quantitative evaluation of discrete LD 50 prediction models: The performance of the discrete LD 50 value predictions was evaluated using the experimental LD 50 values from the embedded evaluation set. The commonly used parameter root mean square error (RMSE) and the coefficient of determination (R 2 ) were calculated for all predictions (Consonni et al. 2009;Todeschini et al. 2016).
whereŷ i and y i are the estimated and observed responses of the i th element, respectively; y is the mean; and n Tr is the number of training compounds. Environmental Health Perspectives 047013-6 129(4) April 2021 The three parameters of the scoring function S were calculated as follows: Evaluation set predictivity = R 2

Consensus Modeling
After being evaluated according to the defined strategy, each model was assigned a score (S) for the predictions within its AD. This score was used in the consensus modeling step as a weighting scheme to combine the predictions from all the submitted models to produce a single consensus prediction for each end point. The majority rule was applied for binary and multiclass end points, whereas the weighted average value in the regression was applied to generate the consensus predictions for the discrete LD 50 end point, as detailed below. This approach resulted in each chemical in the prediction set being assigned a consensus prediction for each of the five end points. Consensus for binary and multiclass end points. For each chemical in the prediction set, the consensus category/call was decided by the weighted majority rule: the class with the highest average score of the models predicting it. This average score was calculated excluding the models that did not provide a prediction within AD for the specific chemical.
Consensus for discrete LD 50 predictions. For each chemical in the prediction set, the consensus predicted LD 50 value was calculated as the average of the predictions within the AD from the different models weighted by their S scores.
The predicted consensus value (C) of the chemical i was calculated as: where N is the number of models that provided predictions within AD for the chemical i, and P j is the predicted LD 50 from each model. The weight (w), summing to 1, for each model j is calculated as: The consensus model predictions for each end point were first evaluated using the same evaluation set used to evaluate the individual models. The defined ADs of the different models were taken into consideration to investigate the accuracy of the final predictions generated by the consensus model. Analysis of the coverage trends and concordance among the individual models' predictions were also conducted as part of the evaluation of the consensus models.
WoE approach to combine all models. The consensus modeling combined the submitted single models for each of the five end points, resulting in five consensus predictions for each  chemical in the prediction set. Because the respective models for the five end points, were trained separately on the data set prepared for each end point (Table 3), the consensus predictions of each of the five end points could disagree for an individual chemical. Examples of discordant predictions were especially likely between the two multiclass end points (U.S. EPA and GHS categorizations), which are based on multiple LD 50 thresholds with overlapping ranges. Discrete LD 50 value predictions could also be slightly inconsistent with the predicted categories. To produce final consensus predictions that were consistent across all five end points for each chemical, it was important to apply correcting rules for the outlier predictions that over or underestimated a specific end point. Thus, a WoE approach was developed to optimize the consensus based on the majority rule and obtain more robust predictions. The fact that there was an odd number of end points and consequent predictions (five) also helped when applying the majority rule to determine the predictions in agreement and minimize the needed degree of correction. This WoE approach also served to combine all five consensus model results into a single prediction per chemical for acute oral toxicity.
In an effort to quantify inherent variability in the animal data used in this work and to determine a confidence interval (CI) representing the uncertainty that should accompany experimentally derived LD 50 values, we have leveraged our large compendium of rat acute oral LD 50 values to compute a 95% CI across standard deviations (SDs) for chemicals having multiple point estimate LD 50 values. SDs for 1,120 chemicals with at least three independent LD 50 values (excluding limit test data) were bootstrapped 1 million times, and the result provided a representative SD that takes into account the range of SDs in the data set.

Additional Evaluation of the Consensus Models
As a further evaluation step for the consensus predictions, highly curated experimental in vivo acute oral toxicity data were used to assess concordance between predictions and experimental outcomes. Briefly, a data set was tandemly generated for the study of variability in acute oral toxicity animal data (Nelms et al. 2020;Ly Pham et al. 2020). This data set was limited to chemicals associated with multiple LD 50 values (including discrete values, limit tests, and ranges/CIs) as well as additional entries pulled from the European Chemicals Agency (ECHA) database for risk assessment (ECHA 2020). The chemicals in this data set were not used in the original CATMoS modeling project. After an initial consistency analysis between the two data sets, a thorough manual curation resulted in a total of 916 chemicals with at least two discrete LD 50 values to be used for the evaluation of the predicted LD 50 values, and a data set of 1,323 chemicals with at least two LD 50 entries including discrete LD 50 values, limit tests, ranges, and CIs to be used for the evaluation of the predicted binary and multiclass categories.
To adapt the curated in vivo experimental data to the five end points studied in this project, the raw formatting was processed using a KNIME workflow to convert the entries into a computerreadable, format. For each chemical (unique CASRN), the replicate data were processed to assign consistent hazard categories, which were unanimous across replicates. Then, to produce high confidence data for model evaluation, the entries were grouped by CASRN to determine the final category based on the majority rule where there was an agreement between the different entries above a certain threshold. For each of the multiclass end points, the agreement was calculated as a concordance percentage, and the threshold for assigning a call was set to 75%. For example, if a chemical A was associated with four entries, three of them in U.S. EPA Category II and one in U.S. EPA Category III, the concordance would be 75%, and the overall assignment would be U.S. EPA Category II. For discrete LD 50 values, the median of only the discrete LD 50 values was taken (within 1.5 log 10 SD threshold to account for outliers). The total number of chemicals having in vivo consensus data, per end point, is summarized in Table 7.

Generalization of the Consensus and Implementation in OPERA
To apply the consensus models beyond the initial prediction set, the combined predictions were used to train generalized models capable of replicating the original consensus. This procedure was achieved by applying a weighted k-nearest neighbor (kNN) approach to fit the classification models based on the majority vote of the nearest neighbors. This approach has the advantage of resembling read-across, a broadly accepted data gap filling tool within regulatory agencies (Cover and Hart 1967;Kowalski and Bender 1972;Todeschini et al. 2015). kNN also fulfills the OECD principles for QSAR modeling, given its nonambiguous algorithm, high accuracy, and interpretability.
To increase the sensitivity of the models for more conservative predictions, all toxic chemicals from the prediction set (LD 50 s less than or equal to 500 mg=kg, i.e., U.S. EPA Categories I and II) were included, whereas less toxic chemicals were included with an 85% concordance threshold among the predicting models for the binary models (VT and NT) and 75% for the remaining modeled end points. Each one of the data sets was divided semirandomly into training and test sets representing 75% and 25%, respectively.
PaDEL (version 2.2) and CDK2 (CDK version 2.0) were first used to calculate two-dimensional molecular descriptors. Because PaDEL uses a previous version of CDK (1.5), duplicate descriptors were excluded. The union of the PaDEL descriptors (1,444) and CDK2 (287) resulted in a total of 1,616 variables that were later filtered for low variance. Subsequently, kNN was coupled with genetic algorithms (GAs) to select a minimized optimal subset of molecular descriptors (form the combined PaDEL-CDK list) for calculating the similarity in the kNN model based on the Euclidean distance. GAs start with an initial random population of binary vectors representing the presence or absence of molecular descriptors. Then an evolutionary process is simulated to optimize a defined fitness function in 5-fold crossvalidation, in which new vectors are created by coupling the binary vectors of the initial population with genetic operations such as crossover and mutation (Ballabio et al. 2011;Leardi and Lupiáñez González 1998).
This procedure was applied separately for each of the modeled end points. The best models were selected and implemented in combination using the WoE approach in the free, standalone, open-source QSAR modeling suite OPERA (Mansouri et al. 2016b. Both OPERA's global and local AD approaches, as well as the accuracy estimation procedure, were applied to all predictions. The global AD is a Boolean index based on the leverage approach for the whole training set, whereas the local AD is a continuous index in the 0-1 range based on the most similar chemical structures from the training set Sahigara et al. 2012). The extended, similarity-based predictive approach as well as the WoE consensus are implemented as the CATMoS consensus model in the OPERA application (version 2.5) . OPERA can be downloaded from the National Institute of Environmental Health Sciences GitHub repository (https:// github.com/NIEHS/OPERA) and used locally via a commandline interface or user-friendly graphical interface. The use of this standalone application facilitates the generation of CATMoS predictions by providing different user input options. The simplest way for the user to input chemicals would be via a text file with chemical identifiers such as CASRN, DTXSID (U.S. EPA's DSSTox database public identifier) or InChIKey. In fact, OPERA contains an internal database with the complete list of ∼ 800,000 DSSTox chemical structures (periodically updated) stored in QSAR-ready format and ready to use for prediction with any of its models. In addition, users can provide their own chemical structures in SMILES or SDF formats as described in Mansouri et al. (2018). Since version 2.6, OPERA has been equipped with an embedded version of the above-mentioned structure standardization workflow and can generate QSAR-ready structures prior to prediction. The local nearest neighbors-based and the global leverage-based AD approaches implemented in OPERA help the user determine whether their chemicals are within the model's interpolation space, where it is safe to generate predictions. In fact, a local AD index of 1 means that the chemical being predicted was one of the 48,137 chemicals of the prediction set and that the initial combined predictions of the single end point models were used to make the final consensus call.

Submitted Consortium Models: Prediction Review
The 35 participating groups submitted a total of 139 models across the five end points, as summarized in Table 8. Each group submitted predictions for the full or partial prediction set for at least one end point. The two binary end points were predicted by the highest number of models; fewer submitted models predicted the multiclass end points and the LD 50 value end point. The number of submitted models was likely an indication of the level of difficulty for each end point, a finding consistent with the previous collaborative projects, CERAPP (Mansouri et al. 2016a) and CoMPARA (Mansouri et al. 2020). The main difficulties that participants faced while modeling these end points were generally related to the skewed nature of the training set, a challenge encountered in many toxicity modeling projects. As shown in Figure 1, most of the data represented nontoxic and low toxicity chemicals (i.e., high LD 50 values), which was also reflected in the binary and multi-categorical end points. However, 11/35 participating groups still submitted models for all five requested end points. The full list of submitted predictions files and model details is available at https://doi.org/10.22427/NTP-DATA-002-00090-0001-0000-2. A KNIME workflow (https://doi.org/10. 22427/NTP-DATA-002-00090-0001-0000-2) was used to process the predictions from all models and combine them in a single file per end point for further evaluation and analysis.

Results of Qualitative and Quantitative Evaluation
All submitted models were reviewed and evaluated by the organizing committee using the criteria described above. The qualitative evaluation ensured clarity of the submitted information, confirmed that all models fulfilled the OECD requirements for computational models as well as the goals of this project, and provided a basis to facilitate the use of the predictions in the following consensus modeling (i.e., compute S score for weighing schema). The subsequent quantitative evaluation step assessed the quality of the predictions prior to the consensus modeling, which was the main goal of the project.
This evaluation was not intended to provide comparisons between the models, especially with the uneven coverages of the prediction set (and consequently the hidden evaluation set) due to AD differences. In fact, the total number of provided predictions per model and the predictions within the AD varied substantially depending on the type of model and the employed AD approach. Thus, the quantitative evaluation served mainly as a first checkpoint to reveal data mishandling issues that might have occured during the initial modeling steps. These issues included mismatches between structures, identifiers, and associated data as well as misinterpretation and inversion of the different data fields/end points, which could potentially lead to a severe decrease in prediction accuracy. Models with such issues were returned to participants to withdraw or replace their submission with updated models. The final evaluated submissions were used to generate summary statistics (Figure 2A; Supplemental Material 5 with detailed parameters).
Most of the models achieved high predictivity scores on the evaluation set ( Figure 2A). The relatively low score for the Dalian University of Technology (DUT) binary models was related to the fact that the submitted predictions covered only small portions of the evaluation set representing only one of two classes for both modeled end points (VT and NT), which led to a BA of 0.5 that was not directly informative as to the real predictivity of this model. Although coverage was not included as a qualifying parameter during evaluation and did not affect quantitative scores, to be inclusive for low-throughput approaches, models with limited coverage had only a marginal influence on the consensus calls and statistics of a prediction set with over 48,000 structures. The remaining models covered most of the prediction set with median coverage ranging from ∼ 41,000 to ∼ 44,000 chemicals (out of 48,137) per end point. This showed that most of the employed AD approaches were rather permissive. The coverage of the different models on the prediction set for the five end points is summarized in Figure 2B and in more details in Supplemental Material 6.
In general, the binary and multiclass models achieved higher scores (median S scores ranging from 0.74 to 0.82) than the discrete LD 50 prediction models (median S LD50 0.66). This finding was expected for such a challenging end point with high variability, a low number of toxic compounds, and different data sources leading to a decrease in precision. As noted with the total number of submitted models per end point, the relatively higher statistics of the binary models confirmed that multiclass end points are more difficult to model. The median S scores for VT and NT reached 0.80

Consensus Modeling
Prior to combining the predictions into consensus calls for evaluation, it was important to check the coverage and concordance among the models. Figure 3A shows that all chemicals in the prediction set were predicted by at least 10 models. Moreover, most chemical structures were predicted by about 20 models for the multiclass and LD 50 value end points and at least 25 models for the binary VT and NT models. This high coverage for all five end points provided a solid basis for the consensus modeling step and strengthened the statistical relevance of the combined predictions.  Figure 2A and B, respectively. Modeling groups along the x-axis are defined in Table 4. The concordance among the models was an equally important criterion for combining the predictions into consensus calls. In fact, it was demonstrated during the previous collaborative modeling projects, CERAPP and CoMPARA, that higher concordance among numerous models built using different modeling approaches corresponded to higher accuracy (Mansouri et al. 2016a(Mansouri et al. , 2020. Figure 3B showed that the concordance among the binary, multiclass, and LD 50 value models was about 0.8 for most of the prediction set chemicals. This high concordance simplified the process of generating consensus calls for the prediction set, especially for the binary and multiclass models for which the consensus classification was largely driven by the majority rule. The exceptions to this were chemicals with cross-model concordance near 0.5, for which only a subset of all models would be driving the classification. In sum, based on the analyses of coverage and concordance between models, it can be concluded that the data were amenable to combining the different model predictions into consensus predictions.
Consensus modeling step 1: combining predictions per end point. The predictions from the five modeled end points were first combined independently based on the defined rules. As described, the models had different contributions in terms of chemical coverage, and each prediction from each model was associated with different weights across the prediction set. After consensus calls were generated for each chemical in the prediction set, the same evaluation procedure was applied for each one of the five end points. The resulting statistical details are reported in Tables 9-12 under the main heading "Step 1: End point." The statistics on the consensus predictions for each of the five end points followed the same general trend as for the single models. The two binary models, VT and NT, showed the highest accuracy for both training and evaluation sets. As the end points increased in precision, going from binary to four or five categories and ultimately to discrete LD 50 value prediction, the performance of the consensus predictions decreased accordingly.
Consensus modeling step 2: WoE approach. The second step of consensus modeling was to generate a single consistent acute oral toxicity prediction per chemical by reconciling the five independent consensus end point predictions. The WoE approach combined predictions from the five end points based on a majority rule. To use binary and multiclass end points with different thresholds, the overlapping ranges of LD 50 categories were used as bins, resulting in a total of seven bins (Figure 4).
To extend the range of the discrete LD 50 values, the inherent animal data variability was considered. The resulting statistical details are reported in Tables 9-12 under the main heading "Step 2: WoE." These statistics represent the evaluation of the consensus model for that end point following the weight of evidence integration of all consensus models across the five end points.
After quantifying the inherent variability based on the bootstrap analysis, the resulting margin of ± 0:3 log 10 ðmg=kgÞ was considered the 95% CI for acute oral LD 50 values. This approach not only quantified and defined a confidence margin for the experimental values, but also informed on an acceptable LD 50 range to apply around LD 50 predictions.
This CI was applied to computing a range for every predicted LD 50 value. Once the winning bin was determined based on the maximum overlap among the five end points, corrections were made on the outlier prediction(s) by adjusting the corresponding category for the multiclass predictions. For the discrete LD 50 value predictions, if the prediction did not fall within the range of the winning bin, a new LD 50 was calculated depending on the reach of the extended LD 50 CI range. To explain this further, an example is illustrated in Table 13, with corresponding prediction ranges represented by arrows in Figure 4.
To computationally automate this process, the concept was translated by an algorithm that converted the bins within each end point prediction ranges to "ones" and the remaining bins to "zeros." The winning bin of the WoE approach was determined by summing the bins and selecting the maximum. In this example, the 50 − 300 mg=kg bin having a total of five overlapping bins is the winner. This means that only the LD 50 value requires adjustment to ensure that the predicted value falls within the WoE-identified "correct" bin range. In this case, the new LD 50 is calculated by taking the average of the lower CI (160 mg=kg) and the upper threshold of the winning bin (300 mg=kg), resulting in an adjusted LD 50 of 230 mg=kg. In general, the rule for adjusting the LD 50 point estimate if it does not fall within the winning bin would be the average of the covered threshold and the corresponding CI boundary. For example, if the winning bin here was 500 − 2,000 mg=kg, the adjusted LD 50 would be ð500 + 613Þ=2 = 556:5 mg=kg. If the CI should span the entire winning bin or be completely nonoverlapping, then the adjusted LD 50 would be its center. In cases where there was more than one winning bin, the most conservative bin was selected.
The statistics of the WoE-adjusted predictions were recalculated and are summarized in Table 9 (Step 2: WoE). In many cases, the calculated parameters did not show a significant difference. However, the performance for the lower categories (highly toxic) of the U.S. EPA and GHS multicategory end points increased significantly, with the WoE approach demonstrating higher sensitivity (Tables 11-12). This improvement is in part because the available data were skewed toward the upper categories (less toxic). Thus, the difference was more noticeable in categories with a lower number of data points. All results of the consensus analysis are available in Supplemental Material 7. Table 9. Evaluation parameters for the LD 50 consensus predictions after the WoE approach.
Step 1: End point Step 2 Table 10. Evaluation parameters for the VT and NT consensus predictions after the WoE approach.
Step 1: End point Step 2

Generalization of the Consensus and Implementation in OPERA
The two-step approach for combining predictions from the 139 submitted models resulted in a robust consensus model that covered the entire prediction set of 48,137 chemical structures. To make the model applicable for further screening of new chemicals, an additional step was required. A weighted-kNN modeling approach was implemented in OPERA (version 2.5) to mimic the initial consensus predictions and generate new ones. This was achieved by training extended models based on the existing experimental data and predictions with high concordance. To facilitate the training process, the five end points were processed separately. Then, the WoE approach was similarly applied to the generated predictions to make a final consistent consensus call. The prefiltered PaDEL and CDK descriptors were used in a GA-kNN procedure to select the most informative variables in a supervised, end point-dependent approach. The resulting minimized numbers of descriptors selected during the training process and performance of the best kNN models are summarized in Table 14.
The descriptors were selected based on the importance ranking performed by the GA during multiple independent runs of generation optimization. The selection of the best models was simultaneously based on maximizing the performance in 5-fold cross-validation as well as minimizing both the number of meaningful descriptors and the model complexity.
The performance statistics of the models summarized in Table 14 showed a high level of accuracy in training set crossvalidation in terms of BA and Q 2 . This high performance was equally complemented, and confirmed, by performances on the test set as further validation of the models. The balance, stability, and robustness of the five end point models were sufficient to simulate the original combined predictions without overfitting the initial set. Thus, the resulting models could be combined via the WoE approach and applied to generate predictions for new chemicals that have sufficient similarity to the original prediction set.

Additional Evaluation Using a Highly Curated Data Set
Preparation of the curation data set and consistency analysis. Prior to combining the experimental data for chemicals with multiple entries from the data collected for the CATMoS project and the ECHA data set to produce a curated data set, a review of the different LD 50 values revealed a number of inconsistencies between the two data sets. This finding can be partly attributed to the variability of animal data, but in some cases, were also due to errors that may have been introduced and propagated during reporting, publishing, or data retrieval. To estimate the disagreement between replicate LD 50 studies per chemical, the different LD 50 values and binary/multiclass calls were compared using a representative from each of the two data sets (as described above). As shown in Table 15, the discordance was highest for the very toxic compounds, as represented by the Sn of the VT end point. This is due to the fact that 25 out of the 38 chemicals that are considered very toxic in the CATMoS data set are associated with an LD 50 > 50 mg=kg in the ECHA data set. Discordance between the Sn and Sp parameters was less apparent for the NT end point. However, a closer look at the confusion matrix generated during the calculation of these parameters revealed a disagreement on a total of 126 chemicals, with 109/126 classified as toxic in the CATMoS data but not in the ECHA data set. A similar level of disagreement was observed for U.S. EPA and GHS categorizations; the confusion matrices in Tables 13 and 14 showed that most of the discordant classifications differed by one category. Even for the agreeing categories, there was disagreement between the corresponding LD 50 values that increased with the wider range categories. These inconsistencies could be due to a number of factors, including the sources and the data interpretation and processing. For example, in the ECHA data set, CASRN 14,489-75-9 was associated with a range of 50 − 300 mg=kg, which placed it outside the "Very Toxic" class. However, in the CATMoS data, the same CASRN is associated with a unique point estimate of 50 mg=kg and was consequently classified as very toxic.
To help with the detection of the outlier entries and the assessment of the data, CATMoS consensus predictions were also considered during the analysis. It was expected that the predictions would diverge even further from the ECHA data set due to the differences with CATMoS experimental data (i.e., data used to train the CATMoS models), as noted above. However, an examination of the LD 50 values revealed several chemicals that were associated with low LD 50 values in ECHA but were predicted as less toxic by CATMoS with a high concordance among the submitted models. A closer look at 28 of these chemicals with available source toxicity reports on the ECHA website revealed a number of reporting errors in the source database, such as unit conversion (grams vs. milligrams), typos, decimal misplacement ("," vs. ".") and estimated doses using read-across, among others. Such findings highlighted how robust and highly concordant CATMoS predictions Table 11. Evaluation parameters for the U.S. EPA category consensus predictions after the WoE approach.

Hazard category
Step 1: End point Step 2 I  II  III  IV Overall  I  II  III  IV Overall  I  II  III  IV Overall  I  II  III I  II  III IV V Overall I  II  III IV V Overall I  II  III IV V Overall I  II  III IV  are because they helped to identify where the compiled in vivo data had typographical errors. These types of errors most likely affect other chemicals in the data set and therefore affect the statistics in Tables 15-17 even further. For this reason, a CI was applied to all CATMoS predictions that covered the observed range of the animal data variability (i.e., estimated at ± 0:3 log values). When this range was applied to the predictions, 96.6% of the ECHA in vivo LD 50 values fell within the confidence interval of the CATMoS predictions. This could also be considered an indication that empirical LD 50 values should consistently be accompanied by a CI. This assessment step also revealed certain CATMoS predictions that were in high disagreement with the original CATMoS empirical data but turned out to agree with the ECHA in vivo data. For example, the LD 50 of CASRN 108-91-8 was predicted by CATMoS to have an LD 50 of 352 mg=kg and reported on the ECHA website as 432 mg=kg (https://echa.europa.eu/briefprofile/-/briefprofile/100.003.300). However, in the experimental data underlying the CATMoS training set, the LD 50 of this chemical was only 11 mg=kg, which was also the value reported for the chemical on the U.S. EPA CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard/dsstoxdb/results?search= DTXSID1023996#toxicity-values). This chemical, although present in the training set and therefore seen by the fitting algorithms during the learning process, was consistently predicted differently by most models based on its structural features and was initially counted as an inaccurate prediction in the evaluation process. The fact that the other chemicals in the data set and the modeling algorithms predicted a value closer to the true value identified from the curated in vivo data set rather than the possibly erroneous value that made its way into the training set revealed that the size of the data set and approaches used could overcome outlier values and that predictions are robust.
The noted differences and inconsistencies in the collected experimental values required a deeper manual curation effort to remove the outlier entries for each chemical that could be erroneous. The resulting highly curated multientry data with an acceptable degree of concordance were used as an additional evaluation set (Supplemental Material 8).
Assessment of the consensus and WoE predictions vs. curated in vivo data. The curated in vivo data set was a subset of chemicals with multiple (at least two unique) LD 50 experimental data entries that was used as an external set to assess the accuracy of the binary, multiclass, and LD 50 value CATMoS predictions.  The resulting classification and regression statistical parameters are summarized in Tables 18-21. As for Tables 9-12 above, the entries under the main heading " Step 1: End point" represent the evaluation of the consensus model developed for the specific end point, whereas entries under the main heading " Step 2: WoE" represent the evaluation of the consensus model for that end point following the WoE integration of all consensus models across the five end points. The statistical parameters in all four tables show high accuracy performances for all five end points, and these metrics are in fact higher than the results on the evaluation set. A similar observation was also reported during the evaluation of the previously mentioned collaborative projects, CERAPP and CoMPARA (Mansouri et al. 2016a(Mansouri et al. , 2020, which showed an increased agreement between the predictions of the consensus models and the evaluation data with the increase of concordant sources. This finding can be also interpreted as an indication of the lower quality and noise in the single-source data points, which cause a decrease in performance especially between training and evaluation sets as noted earlier (Tables 9-12).
There was a slight decrease in the statistical parameters for LD 50 predictions after the application of the WoE approach (Table 18). This could have been because some of the newly assigned LD 50 values were based on semi-arbitrary calculations within the winning bin. The adjustment was intended to place the LD 50 value within the correct category, but there is always the possibility that the value could become skewed further from the experimental value in comparison to the initially predicted LD 50 on the other side of the category threshold. Table 19 showed similar statistics before and after the application of the WoE approach, implying that the binary end point consensus predictions did not require substantive adjustment. However, higher predictive performances for the multiclass models were noted after the application of the WoE adjustments (Tables 20-21). This was particularly clear for categories representing the most toxic compounds and indicated that the WoE consensus predictions were more conservative than the initial multiclass predictions. The GHS WoE (Table 21) seemed to have more balanced predictivity/sensitivity in comparison with the U.S. EPA WoE (Table 20), which showed a drop in sensitivity for U.S. EPA Category IV similar to that observed in initial evaluation (Table 11). This was mostly due to the lower number of chemicals tested in U.S. EPA Category IV (>5,000 mg=kg) in comparison with GHS Category 5 (>2,000 mg=kg).
In addition to the demonstrated high performances, this evaluation showed the utility of CATMoS both for providing accurate predictions for new chemicals and for revisiting and filtering existing data for additional curation. Regulatory agencies could benefit from both application of the model to consider new predictions and associated confidence intervals as well as checking previous decisions for additional assessment of the data and the predictions. Both of these applications are currently being considered by members of ICCVAM and industry stakeholders.

Limitations
The predictive ability of CATMoS is limited to the quality of the data used to train and evaluate the contributing models. Certainly, the lack of metadata of the collected training and evaluation set is a limiting factor to delineate any study differences or sources of variability in the in vivo assays. Although the predictions are able to identify specific erroneous data points, it still leaves uncertainty regarding the overall reliability of in vivo data, which is still not well-characterized. As discussed, the skewness of the in vivo data caused limitations in the predictivity of the single models and the initial consensus for the highly toxic chemicals. However, the data still contained more than 400 chemicals in the very toxic class (<50 mg=kg), and as demonstrated in Tables 11-12, these limitations at the lower end of the multiclass prediction models were largely remediated by the final WoE approach.
Similar to most QSAR/QSPR models, the CATMoS can be applied only to single organic compounds. Mixtures of organic chemicals should be studied separately. However, to accommodate mixtures of multiple compounds, the GHS system provides an additivity rule to help classify chemicals that can use CATMoS predictions as input (http://www.unece.org/fileadmin/ DAM/trans/danger/publi/ghs/GHS_presentations/English/health_ env_e.pdf).
Additionally, as most molecular descriptors are developed for small-and medium-size molecules, CATMoS and other QSAR models cannot process large biomolecules, long polymeric chains, and nanomaterials. To help identify and use the most adequate chemical structures for predictions and to avoid   unpredictable substances, OPERA users wishing to generate CATMoS predictions can use either of two input options: • Provide a text file with a list of chemical identifiers (CASRN, DTXSID, InChI) and let OPERA pull the correct QSAR-ready structure from its database of 830,000 highly curated DSSTox chemicals; or, • Provide their own structures but apply the embedded standardization workflow to generate QSAR-ready structures prior to curation. A possible limiting aspect of the OPERA standalone application is that it must be installed locally, requiring access to the user's computer operating system and hardware to perform calculations. Users can, however, avoid this process and access predictions by visiting the Integrated Chemical Environment (ICE) dashboard of the NTP (https://ice.ntp.niehs.nih.gov/) and querying its internal database of predictions for the DSSTox chemicals. These predictions will also be made available on the U.S. EPA CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard/).

Conclusion
This project was organized by the ICCVAM ATWG as an implementation of the strategic roadmap for the development and validation of new alternative methods for acute oral toxicity testing. The resulting CATMoS models provide consensus predictions for 48,137 chemicals of interest to regulatory agencies and stakeholders. CATMoS combined contributions from 35 internationally renowned groups in the field of in silico modeling. This is the third such collaborative project for most of the consortium members, following endocrine disruption modeling efforts CERAPP and CoMPARA. However, the uniqueness of CATMoS comes from the fit-for-purpose end points resulting from the upfront participation of regulatory agencies. Thus the needs of the ICCVAM ATWG member agencies and international partners were considered when defining the end points for predictive modeling, ensuring that there would be regulatory interest in potentially using the predicted acute oral toxicity models, and supporting agencies' interest in alternative methods.
This early stage involvement of regulators not only identified the five modeled end points representing the different uses of the data but also facilitated open stakeholder dialog during the project's workshop held at the National Institutes of Health in Bethesda, Maryland (Kleinstreuer et al. 2018). At the workshop, participating groups presented their approaches to overcoming the different challenges of the project, such as the skewed data distribution or tackling particular chemistries. The consensus modeling and the implementation of the final CATMoS model and its uses were also discussed among the modelers and the stakeholders. Currently, consensus predictions on specific chemicals of interest are being assessed by different regulatory agencies for potential use as alternative sources of data. For example, a list of more than 100 chemicals and corresponding LD 50 values derived from existing regulatory studies were identified by the U.S. EPA and are being checked, curated, and compared with CATMoS consensus predictions. The preliminary analysis shows that in 96% of the cases, the CATMoS predictions are either overlapping or more conservative than the existing LD 50 . In the few cases where the disagreement is highest, a closer look showed potential issues with the considered sources of the in vivo studies and in some cases disagreement with the in vivo LD 50 values used to train CATMoS models.
The details of the CATMoS model and predictions are available in a QSAR Model Report Format (QMRF) that was submitted to the European Commission's JRC for review and publication on their QMRF Inventory for easy access by the international community (European Commission 2013; JRC 2017).
In addition to the initial prediction set, CATMoS was implemented in OPERA and used to screen the list of 837,000 chemical substances in the U.S. EPA's DSSTox database (underpinning the Dashboard application). These predictions are made available via the Integrated Chemical Environment dashboard (https://ice. ntp.niehs.nih.gov/) of the NTP and in the future via the U.S. EPA's CompTox Chemicals Dashboard (https://comptox.epa. gov/dashboard/).
CATMoS is an example of how toxicological problems can be solved collaboratively using computational approaches. In fact, the resulting consensus models leverage the strengths and overcome the limitations of any single approach, proving to be as good as or better than animal data. Such successful collaborative projects support international collaboration, a legacy of free and open-source code and workflows, and increasing consideration and adoption from regulators interested in implementing NAMs. Finally, it is worth noting that the international aspect of these collaborations can also help with harmonizing global regulatory processes toward a universal system. We now have a solid foundation for future collaborations to establish globally accepted alternative methods for assessing acute toxicity end points. Table 18. Evaluation parameters for the LD 50 point estimate consensus predictions of the curated in vivo data.
Step 1: End point Step 2  . Evaluation parameters for the VT and NT binary end point predictions for the curated in vivo data set.
Step 1: End point Step 2 Table 20. Evaluation parameters for the U.S. EPA category predictions for the curated in vivo data set.
Step 1: End point Step 2