Structure-activity relations: maximizing the usefulness of mutagenicity and carcinogenicity databases.

The most important criteria for the development and analysis of databases for elucidating the structural bases of toxicological activity include the integrity of the databases with respect to uniformity of the experimental protocol and interpretation of the test results and inclusion of chemicals representing different chemical classes and differing mechanisms of action. Within these criteria, it is demonstrated that when the chemicals are chosen at random, the larger the database, the better the predictivity of chemicals not included in the learning set. It is shown however, that when chemicals are selected on the basis of structural features, that a learning set of approximately 180 chemicals is as informative as a database consisting of 800 chemicals chosen at random.


Introduction
Whereas most current studies, including those reported in this symposium, deal with the classification of information, our approach is to rationalize the available data and thus permit extrapolation to molecules that have as yet not been tested. The eventual outcome of such an approach is to help optimize ongoing studies such that they, in turn, will provide a maximal amount of information of a mechanistic and predictive nature.
A characteristic ofthe available databases ofdirect concern to us is that they are composed of a wide variety ofchemicals with diverse structures, e.g., polycyclic aromatic hydrocarbons, nitrosamines, halogenated hydrocarbons, and dyes. They are thus not amenable to traditional quantitative structure-activity relationship (QSAR)-type studies, which require congeneric databases. To overcome this obstacle, we have been investigating how knowledge-based systems could be useful.
In this survey we discuss our experience with using available databases ofcarcinogenic, mutagenic, and related biological end points to establish such a resource undertaken for mechanistic as well as predictive purposes. We stress that our analysis is, of course, influenced by the structure-activity relationship (SAR) methodology we employ, namely, CASE (1,2). Howvver, CASE, as an artificial intelligence/expert system, is probably one ofthe most advanced SAR methods available and a harbinger of the developments that are to come to this methodology in general. In this connection, it is germane to list some of the unique features that were incorporated into this program: *Department of Chemistry, Case Western Reserve University, Cleveland, OH 44106. tDepartment of Environmental and Occupational Health, Graduate School of 1. In order to make the system truly effective, it had to be independent of operator-formulated questions. These by definition, are finite and simplistic and usually based upon previous knowledge (i.e., they may be biased). Thus, we required the system to generate all ofthe possible structural "descriptors" automatically without operator input. 2. Because we felt that the biological activity is dependent on molecular subunits usually larger and more complex than those considered by chemists as simple functionalities, we required that our system be capable of handling and identifying relatively large molecular moieties. 3. We realized that most SAR systems are based on the analysis of congeneric databases. e.g., nitroarenes, aromatic amines, halogenated hydrocarbons, etc. This, however, involves the arbitrary assignment ofchemicals to certain classes and, when dealing with databases containing various chemical classes, might a) intrduce bias in the selection process and b) make the emerging classes too small for adequate structural analysis. Moreover, if biological activity is derived inherently from structural features, by "artificially" separating chemicals into restricted classes, we might lose informational content. Accordingly, we needed a system that was able to handle, in a single database, molecules of very different chemical types. 4. We also wanted to include in our system the capability of updating the "descriptors" as information on new molecules became available, i.e., the system had to be self-learning. 5. We required the resulting analyses not only to be predictive but also to provide clues as to possible mechanisms of biological activity. 6. Finally, we wanted the system to handle biological end points that may result from a single mechanism, a com-bination ofconcerted or sequential actions, or two or more independent mechanisms. The program we constructed to meet these requirements is called CASE and, in its latest and greatly improved version, MULTICASE.

Nature of the Database
What are some ofour conclusions relating to the requirements of the composition of a database to be useful for SAR studies? Because of the growing sophistication of computer-based SAR systems and the investment oftime as well as CPU cost, it is important to begin with carefully evaluated databases. Some ofthe requirements for such databases are self-evident: The database must be obtained using standard protocols for which the quality ofthe data must be monitored. This is borne out by experience with some ofthe cytogenetic end points that are contained in the Gene-Tox Program database as compared to those in the National Toxicology Program (NTP) compilations. Thus, even using the peer-review process inherent in the Gene-Tox Program, the SAR methodology could not be applied with great success to some of the Gene-Tox databases (unpublished results), whereas the NTP databases allowed thorough analyses ofthe structural features of the cytogenetic activities (3). Our analysis of the data led us to conclude that the difference came from the quality control, which was assured in the NTP protocol but could not be controlled in the Gene-Tox Program. This is due to the fact that the latter relied on published data, and, moreover, the chemical purity (or rather impurity) was not generally known. Nevertheless, CASE can handle some "fuzziness" in the data because it is based on the statistical evaluation ofthe importance ofsubstructure rather than on a quantitative relation. The second, also perhaps self-evident, point relates to the interpretation of the data. Both the biological and statistical standards used for interpreting the test results must be adhered to rigidly and must, of course, be spelled out initially. This is especially important in the manner in which results are expressed when a continuous activity scale is used, e.g., revertants per nanomole or milligrams per kilogram per day. Although there are computer routines to scale and model cutoffs, this judgment should not be delegated to a computer. In fact, we find that the human expert is essential to determine the boundary between inactive and marginally active chemicals and between marginally active and active chemicals. The scaling can then be done by the computer following these initial boundary settings.
In addition to defining the accepted definitions of the end points with respect to databases, the purpose ofthe analysis must also be defined aprioi. Thus, for different mechanistic purposes we might use different mutagenicity databases or different collections of data derived from these compilations. For example, a) mutagenicity in a specific Salmonella tester strain in the absence of S9 might be used. The purpose of such an analysis would be perhaps to study the structural basis ofthe mutagenicity of nitrated polycyclic aromatic hydrocarbons, which are maximally expressed in strain TA98 in the absence of exogenous metabolic activation (1,4). b) The activity might be in Salmonella TAIOO or in TA98 in the presence of S9. Thus, for the aromatic amines, this would allow the determination of specific combinations that would be informative. In fact, using a database ofonly TA100 in the presence of S9 as against one of only TA98 in the presence of S9, the effect of mutagenic specificity by aromatic amines at a single guanine-cytosine (G-C) base pair (as in TA100) as opposed to the specificity ata series ofalternating G-C pairs (as in TA98) was amenable to analysis (5,6). c) On the other hand, if we wish to study and evaluate how Salmonella mutagenicity compares in its predictivity to the results of rodent carcinogenicity bioassays, then for the Salmonella data we might define as a positive response a response in any one of the tester strains obtained either in the presence or absence of S9 (2,7). Under these conditions, of course, it must be ascertained that before a chemical is designated as negative that it has indeed been tested in the complete panel oftester strains both in the presence and absence of S9. Obviously, this approach, whichhas also been used to assess the predictivity of the Salmonella assay for carcinogenicity in rodents (810), assumes that the structural determinants responsible for activity in different strains are identical and equally related to carcinogenicity. In fact, we know that these are oversimplifications.

Nature of the Chemicals in the Database
The natureofthe chemicals represented inthe database is probably themostdifficultto address. In viewofthe factthatCASE can handle noncongeneric databases, then some ofthe requirements are easy to statebutdifficult to implement: A database (i.e., learning set) shouldcontain representatives ofvarious chemical classes and chemicals thatcoverthe spectrumofmechanisms that induce a particular biological end point (11). To satisfy such a requirement, it might seem a truism to say that a variety of chemical classes need to be represented in the data base. However, this is more easily said than done. Looking at the problem from a chemical pointofview, using, forexample, the experience ofthe Gene-Tox Program, initially, in excess of60 chemical classes were defined (12). This resulted in a situation that for SAR studies, there were very few representatives per chemical class such that SAR analyses were not feasible. Subsequently, the number of chemical classes was reduced to 30 (13). Still, this left many chemical classes underrepresented for the purpose of SAR analysis. Moreover, such separations imply that we know those structural features that are necessary for biological activity. This, ofcourse, involves a selection bias. On the other hand, a system like CASE selects its own descriptors, and these in turn allow it to bypass the traditional chemical classes (Table 1).
It is of interest that by and large the CASE program usually identifies not more than approximately 12 to 15 significant biophores in a noncongeneric database ( Table 2). This means that there are usually a sufficient number of representative chemicals in each ofthe biophore classes that are selected. Additionally, CASE also identifies biophobes, i.e., functionalities that contribute to a lack of activity. These might be considered as "non-alerts" in the scheme of Ashby (14).
In creating databases, if one has unlimited funds, one might decide to test as many chemicals as possible and enter the results into the database. Under such conditions, one need not be concerned about redundancy of chemical structures. Thus, one would not be worried that there might be too many nitrofurans in a database, as the logic ofthe CASE program sees to it that the biophore associated with nitrofurans will achieve significance when a certain preset probability value has been reached (e.g., Table 1. Molecules sharing a common biophore.5 Benzo(e)pyrene 1,8,9-Trihydroxyanthracene 1-Naphthylisothiocyanate 2-Anthramine 7,9-Dimethylbenz(c)acridine N-(l-Naphthyl)ethylenediam 7-Bromomethyl-12-methylbenz(a)anthracene aNote that polycyclic aromatic hydrocarbons, aromatic amines, nitroarenes, and others contain this biophore, which has been identified as significant with respect to mutagenicity in Salmonella (p = 0.0001). tive, and marginally active molecules. This distribution is used to predict the likelihood that the presence (or absence) of the fragment contributes to carcinogenicity. Also listed are the probability values associated with the fragments. bCindicates a carbon atom common to two rings. C " indicates the carbon is attached by a double bond to an outside substituent. p < 0.05), at which point that biophore will be flagged and identified. Having more representative molecules containing that biophore in the pool will not contribute overwhelmingly to the predictivity. Similarly, using the same reasoning, ifthe database contains too great a prevalence ofactive molecules, this will not affect the identification and performance of the predictive biophores that are identified provided that there are a sufficient number of molecules in the database.

Congeneric versus Noncongeneric Databases
Heretofore the majority ofSAR metfiods were designed for the study and prediction ofcongeneric databases such as those containing polycyclic aromatic hydrocarbons, nitroarenes, aromatic amines, etc. One of the breakthroughs provided by CASE is its ability to analyze none congeneric databases (i.e., mixed databases). The question then facing us is: Everything else being equal, how do the two types of databases compare with respect to predictivity?
Let us take the NTP Carcinogen Database as an example. In one of our analyses we had approximately 250 chemicals, of which 53 were aromatic amines (15,16) (Table 3). As an exercise, we then selected the 53 aromatic amines and used them to construct a congeneric database. We then used the two databases, the noncongeneric and the congeneric one to study the predictivity of each of them for aromatic amines. The results indicate that both databases were highly predictive of the carcinogenicity of aromatic amines (Table 4). However, further analysis indicated that the noncongeneric database was significantly more predictive than the congeneric one. This appears to be derived from the fact that the biophores selected by CASE may cut across chemical species and could, for example, have been derived in the case of the noncongeneric database not only from aromatic amines but possibly from related nitroarenes. In the case ofthe congeneric database consisting only of aromatic amines, such biophores would not necessarily be identified (Table 5). Thus, the noncongeneric database may indeed be superior even when a sufficient number of congeneric database chemicals are present therein because CASE can learn from related molecules which, for example, may yield the same metabolic intermediates (e.g., N-arylhydroxylamines), which are derived from both arylamines and nitroarenes by oxidative and reductive pathways, respectively. bCASE identified these aromatic amine-derived fragments as associated with an increased probability of carcinogenicity. 'R is not a hydrogen. This biophore was identified in the total database consisting of253 chemicals. It was not found to have significance in the database consisting only of aromatic amines.
bClassifications of Ashby and Tennant (9). A, induces tumors in rats and mice; B, carcinogenic toonly one species but induces cancers at tw or more sites; D, carcinogenic at a single site in a single species; NC, noncarcinogenic.
How Many Chemicals Are Needed in a Database?
Obviously, we do not possess unlimited resources, and the number of chemicals that can be tested is by necessity limited. Therefore, we might ask the question, how many chemicals are needed in the learning set when it consists of noncongeneric molecules? This is a question that needs to be decided when testing programs are set up and the data used for SAR analyses.
In Table 6 we show the predictivity ofusing different numbers of chemicals in the learning set. The chemicals in the learning set were selected at random from the NTP Salmonella Mutagenicity Database. The biophores identified using the different databases were then used to predict the mutagenicity ofa panel of 100 chemicals, not in the learning set, but for which the test results were available ("diagnostic tester set").
The results ofthe analysis clearly show that the predictive performance ofCASE improves with an increase in the number of chemicals in the learning set (Table 6). However, generating databases consisting ofsuch large numbers ofchemicals is costly, even if we restrict the testing to the Salmonella mutagenicity assay only. Obviously, if we use experimental systems that are more labor intensive or which involve large numbers ofanimals, the cost will rapidly become prohibitive. Thus, it is not surprising that very few rodent cancer bioassays have been repeated given a cost which may exceed $1 million per assay.
In search for a method to decrease the number of chemicals that require testing (i.e., to limit the size ofthe database), we explored a number of possibilities. We devised an efficient procedure, which is dependent on another feature ofCASE. Indeed, CASE predictions can take a number offorms: a) a chemical can 81.0 31.3 'All of the subsets of chemicals are contained within the set of 820. The chemicals were chosen at random from among the 820 molecules in the database except for the 243 chemicals, also a subset of the 820 that represents chemicals on which rodent cancer bioassays were performed (9,10). be predicted to be active because it contains a biophore (Fig. 1); b) a chemical can be predicted to be inactive because it contains a biophobe (Fig. 2); c) a molecule can be predicted to be inactive because it lacks a recognizable biophore and/or biophobe. That later prediction is due to the fact that fragmentation of the molecule yields fragments that had been seen before but had beenjudged to be trivial with respect to biological activity (Fig.  3); d) an additional possibility is that the CASE program has identified an unrecognized functionality (i.e., unknown). This refers to the presence ofa fragment that has not been documented among the collection of fragments generated (Figs. 4-6). The presence of such a fragment introduces a note ofinconclusiveness into the prediction, which appears to be the major reason for decreased predictive performance (17). This uncertainty might, in fact, be a direct function of the number of chemicals in the learning set. That is, the more chemicals there are in the learning set, the less the chance that this message will appear (Table 7).
Can SAR Concepts Reduce the Number of Chemicals That Need to Be Tested to Generate a Useful Database?
As shown above, the larger the database, the better the predictive performance. However, the question then still is are we really interested in generating, at a great cost, databases that allow 85, 90, 99, or even 99.9 % concordances between experimental results and predictions, especially ifthe actual experimental data are only 85 % reproducible, as appears to be the situation with the NTP Salmonella Mutagenicity Database? In fact, a number of analyses have indicated that the limit of the predictivity of the CASE program for mutagenicity in Salmonella is 1 ,3-Dimethyl-4-nitrobenzene 93% chanceof being ACTIVE due to substructure (Conf. level = 100%): N02-C=CH-CH= 80% chance of being ACTIVE due to substructure (Conf. level = 87%): N02-C=C-CH= * * *OVERALL,theprobabilityof being aSalmonellamutagen is98.2%* * * approximately 80% (17). Moreover, the reproducibility ofthe rodent carcinogenicity bioassay is largely unknown since so few of the chemicals have been tested more than once. We ought to aim for economy and reliability.   CASE recognized a biophore that leads to an 87 % probability ofmutagenicity. However, this conclusion must be moderated by the fact that this molecule contains a fragment that has not been seen before and that could be a biophobe. FIGURE 5. CASE prediction of the lack of mutagenicity of cysteine due to the presence of a biophobe. However, this conclusion is moderated by the fact that cysteine cortains a fragment not seen before, which might be a biophore. FIGURE 6. A CASE prediction of lack of mutagenicity due to the fact that fragmentation of the molecule does not lead to the generation of either a biophore or a biophobe. However, this prediction must be moderated by the fact that a fragment, heretofore not seen, has been recognized, and it could be a biophore. aThe diagnostic tester set consisted of 100 molecules that were not present in the learning sets. bNumber ofpredictions in the diagnostic tester that contain fragments that had not been seen before (see Figs. 4-6) and which therefore may be inconclusive.
Our analysis of the performance of CASE suggests that perhaps we could use the uncertainty factor (see above) in the design of a database that will contain the fewest chemicals that need to be tested. To test this hypothesis, the following protocol was devised and tested: a) a list ofchemicals including different chemical classes, different uses, and different levels ofproduction (18) is selected; b) from among this list, we might select, at random, say 100 chemicals that will be tested (or which already may have been tested). This forms the original learning set (set 1); c) this original learning set is analyzed by CASE and the biophores and biophobes are identified; d) another set of50 randomly selected chemicals (excluding those chemicals in set 1), as yet untested, are run against this original learning set; e) those chemicals which, in step d, yield a prediction that includes the uncertainy message are then selected for testing as mutagens in Salmonella. The results of the tests are then included in a new learning set (set 2, which includes the chemicals in set I);J) this procedure is performed iteratively.
To evaluate the effectiveness ofthe selection procedure at each step, the predictivity ofthe database is evaluated by challenging it with the hundred chemicals not included in the learning sets but for which test results are available ("diagnostic tester set"). The results ofthese analyses clearly indicate ( Table 8) that by careful selection ofchemicals for testing, the number ofchemicals that need to be tested to generate a learning set adequate for SAR predictions can be greatly reduced. This is accompanied by a corresponding decrease in cost. In fact, a comparison ofTables 6 and 8 clearly indicates that a database of approximately 180 carefully selected chemicals is as predictive as a database consisting of in excess of 800 chemicals that have not been selected by the criteria. Similar results were obtained (19) using the Gene-  'The first 100 chemicals are presumed to represent a random assortment of molecules. They were not selected by the procedure described here. Each subsequent set contains an increment ofmolecules selected from sets of50. Thus, set 2 contains the previous 100 molecules (set 1) plus 19 selected from among a set of 50. Similarly, set 3 contains the previous 119 moleclues (set 2) plus 23 molecules selected from among another set of 50. This procedure is used iteratively. The selection of molecules is described in the text.
Tox Salmonella database, which not only contains a different collection of chemicals but also a higher prevalence of mutagens (78.5 % versus 36.5 %).

Quantitative versus Binary Databases
There are two major ways of expressing the results of SAR analyses: a) as active, marginally active, and inactive. This, a priori, involves a prejudgment involving the human expert who then sets the boundaries as to what is considered a positive, marginal, and negative result. Summarizing the results in this manner and applying the CASE program leads to the generation of fragments that are involved in the probability of a certain biological activity (e.g., mutagenicity, carcinogenicity) and therefore provides a possible procedure for risk identification for testing prioritization. Indeed, such probabilities appear to be related to the biological properties. Thus, the degree ofthe carcinogenicity ofa chemical as defined by Ashby and Tennant (9) appears to be related to the probability of carcinogenicity: an overall high probability is associated with chemicals that cause cancer in both rats and mice at multiple sites ofboth sexes (Table  9). b) When, however, results are expressed in a continuous scale (e.g., TD50 in milligrams per kilogram per day or mutagenicity in revertants per nanomole), this permits two independent analyses to be carried out: a probability ofcarcinogenicity, identical to the procedure described above and a QSAR analysis, which leads to the identification of biophores and biophobes associated with potency (Table 10). The latter analysis leads to a second prediction, that of potency (Fig. 7).
Thus, from the fragments associated with the mutagenic potency of nitroarenes (Table 10), we can calculate the projected activity of a chemical where CASE activity = 9.208 + nF + 1.22 log P where n is the number of times a fragment occurs in the molecule, Fis the CASE activity associated with that fragment (Table 10), and P is the n-octanol/water partition coefficient. Accordingly, for 1,6-dinitropyrene (Fig. 7), biophore B (which is present twice) is the same as biophore 7 (Table 10), which is associated with a mutagenic potency of 21.576 units, and, moreover, 1,6 dinitropyrene also contains two copies ofbiophore 3 (17,093 units) (Table 10). aThe classification of Ashby and Tennant (9) was adopted. In that scheme, chemicals in group A induce tumors in rats and mice. Group B includes chemicals that are carcinogenic to only one species but which include cancers at two or more sites. Group C consists of chemicals that are carcinogenic at only a single site in both sexes of a single species. Group D contains chemicals carcinogenic at a single site ina single species. Group E is adequately studied chemicals for which only equivocal evidence ofcarcinogenicity was observed. NC, noncarcinogens.  Another feature of CASE is the estimation of the log P, which for 1,6-dinitropyrene is 4.7968. Accordingly, the CASE activity of 1,6-dinitropyrene is 9.208 + [2(17.093)] + [2(21.576)] + 4.7968 = 92 CASE activity is expressed in a log scale; an activity of 92 is equivalent to 200,000 revertants/nmole. This, in fact, goes beyond mere risk identification, which is based on the probability of activity, because the prediction also involves a measure of potency that can be used in a quantitative risk assessment. Ob- viously, before using such potency values, they must be compared against the experimentally obtained values (Fig. 8).
In addition to allowing an estimation of the potency of an unknown chemical, the identification ofthe biophores associated with potency permits an assessment of the relative role of different biophores in activity. Thus, as a result ofthe logarithmic nature of the activity scale, it can be calculated that biophore 7 (Table 10) contributes 88% ofthe mutagenicity of 1 ,6-dinitropyrene; i.e, it is the major contributor to potency. Such analyses have obvious mechanistic implications (20). Thus, studies seeking to investigate the basis of the potent mutagenicity of this chemical should concentrate on that portion of the molecule.

Comparison of Some Carcinogenicity Databases
We have performed analyses ofa number ofdifferent rodent carcinogenicity databases including the NTP rodent bioassay (9,10) and the compilation of Gold and associates (21)(22)(23) which includes TD50 values. Analyzing these databases allowed us to identify the structural features responsible for the probability of carcinogenicity (i.e., the qualitative aspect of the analysis). Moreover, by analyzing carcinogenicity for only the mouse or only the rat, we were able to identify biophores characteristic for each of these activities as well as biophores common to both the rat and the mouse. These have mechanistic implications that will not be described here (see below).
Applying the QSAR CASE analysis to the database assembled by Gold et al. (21)(22)(23), which includes TD50 values, indicated that the data could be used to generate QSAR relationships for individual databases (i.e., mouse and rat separately) which, in turn, can be used to project carcinogenic potencies based upon the QSAR contribution of individual biophores (Table 11). [The TD50 value is defined as the lifetime dose (milligrams per kilogram per day) that reduces by one-half the lifetime chance of remaining tumor-free (24). Thus, carcinogenic potency can be projected in a manner similar to that described for the mutagenicity of 1,6-dinitropyrene (see above).

Validation
Before using the biophores and biophobes for predictive and mechanistic studies, a number ofcontrols need to be performed. Routinely, from the available database, a setofrandomly selected chemicals is removedbefore the CASEanalysis. Thesechemicals are then used as a tester set to test the predictivity ofthe data set. Subsequently, ofcourse, these chemicals can be added back to the learning set and the CASE analysis performed again.
Additionally, when analyzing biological activities such as mutagenicity, cytogenotoxicity, and carcinogenicity, we also assembled a database of naturally occurring physiological chemicals (amino acids, sugars, lipids, purines, pyrimidines, vitamins, etc.). These chemicals are expected to be negative. However, on occasion, we have found that some databases led to predictions that a significant fraction of physiological chemicals induced some end points, e.g., sister chromatid exchange. Such findings cast doubt on the relevance of such assays as predictors of carcinogenicity (25).

Data Management
Finally, a data management system must be in place to keep track of the various predictions that are made in the course of these analyses. Additionally, this will enable testing correlations ofpredictions. Thus, such a data management system permitted CI-CH2- bC. indicates a carbon atom common to two rings. C " indicates the carbon is attached by a double bond to an outside substituent.  us to determine how often carcinogens are predicted to be mutagens or how often Salmonella mutagens are predicted also to induce chromosomal aberrations (Table 12).
Such a database can also be useful for other purposes. For example, it could be used to determine the effect on the results of short-term tests ofdifferent prevalences ofcarcinogens. This, in turn, will influence the testing strategy used to detect carcinogens. Such an analysis shows ( Table 13) that there is a considerable proportion of false positive results to be expected, as evidenced by the fact that when the prevalence ofcarcinogens is 0%, we can expect 14 and 47% of the chemicals to respond positively in the Salmonella mutagenicity and sister chromatid assays, respectively (Table 13). Moreover, an unacceptably high rate of false negatives is to be expected as well, for a population of only carcinogens (100% prevalence) is expected to yield only a 44% rate ofpositive responses in the Salmonella mutagenicity assay.

Conclusion
Pbwerful computer-based expert systems to study SAR are now available. However, as illustrated here, these methodologies are greatly dependent on the nature and organization ofdatabases. Moreover, it has been demonstrated that databases need not be extensive for SAR analysis. Thus the present studies indicate that predictive toxicology is possible.