Predicting toxicity through a computer automated structure evaluation program.

The computer automated structure evaluation program (CASE) has been extended to perform automatic quantitative structure-activity relationships (QSAR). Applications include the carcinogenicity of polycyclic aromatic hydrocarbons and of N-nitrosamines. Agreement with experiment is satisfactory.

Predicting Toxicity through a Computer Automated Structure Evaluation Program by Gilles Klopman* The computer automated structure evaluation program (CASE) has been extended to perform automatic quantitative structure-activity relationships (QSAR). Applications include the carcinogenicity of polycyclic aromatic hydrocarbons and of N-nitrosamines. Agreement with experiment is satisfactory.
The development and wide utilization of short-term assays (1) has led to the realization that a large number of chemicals present in the environment are potent bacterial mutagens and potential carcinogenic agents. Among these, polycyclic aromatic hydrocarbons (PAHs) (2), their nitro derivatives (3,4), a number of aromatic amines (5), and halogenated (6,7) as well as N-nitroso derivatives (8,9) are ubiquitous in the environment. Obviously, the presence of such biologically potent molecules in the air and in foods is cause for concern and it is not surprising to find that a major effort is underway to try to identify and eliminate these chemicals from the environment. However, while, as a class, these chemicals are suspicious, only a small fraction of them are really active and many show little or no activity. Thus, it is of importance to determine whether a more precise relationship between their structure and activity can be found.
The need for theoretical methods, capable of providing such relationships, is particularly important in view of the fact that it is inconceivable that every suspect chemical structure could be synthesized and tested in the laboratory.
Numerous methods, e.g., QSAR, pattern recognition, etc., aim at this goal, and most of them are discussed elsewhere in this volume. We will not review them here; instead, we will discuss our recently developed computer automated structure evaluation program (CASE) (10) which seems ideally suited for such a task and provides a pattern for the a priori recognition of potentially toxic molecules.
The Computer Automated Structure Evaluation Program (CASE) One of the major difficulties in structure activity and quantitative structure-activity relationship (QSAR) *Chemistry Department, Case Western Reserve University, Cleveland, OH 44106 studies is the selection of appropriate molecular properties to be used as descriptors. Indeed, the success of a QSAR study often depends on the outcome of an intelligent and properly discriminant analysis of the factors that may be "determinant" in the observed activities.
The computer automated structure evaluation technique (10) addresses this problem and is used to select automatically the substructural units that are most appropriate to discriminate between active and inactive molecules. The method was described previously and consists of tabulating, for each molecule of a training set, the type of fragments that can be formed by breaking up the molecule in linear subunits consisting of 3 to 12 nonhydrogen atoms, together with the hydrogens attached to them. Each fragment generated by the breakup of a toxic molecule is labeled toxic, and each fragment generated by the breakup of a nontoxic molecule is labeled nontoxic. The fragments obtained from the full training set are collected and analyzed. Each type of fragment generated by the data base is evaluated on the premise that, if it is not related to toxic activity, it would be encountered randomly in toxic and nontoxic molecules. Any significant deviation from random distribution, at the 95% confidence level, is taken as an indictation that the fragment is relevant to the observed toxicities. The method thus generates a finite set of substructural units presumed to be responsible for the observed toxicity of the molecules of the data set.
The choice of chemical substructural units as descriptors for activity makes a lot of sense to chemists who are used to relating chemical properties to functionalities and it is not surprising to find that much interest has centered around methodologies based on such descriptors (11)(12)(13). In most previous studies, though, only a finite number of preselected keys (11,12), consisting each of a well defined substructural unit, were considered. The problem with such an approach is that the selection of the keys was left to the imagination of the authors and to a large degree reflected their own bias. Furthermore, there was no guarantee that the selected keys were appropriate to handle the specific problem. In more recent developments, though, attempts were made at automating the selection of keys. The work of Chu (14) and Hodes (15,16) are examples of such developments and their programs are clearly related to our own methodology. The problem with these programs was that no causal relationship was established between the substructural descriptors and the activity of the molecules in the data base. Indeed, the keys were either too small or too restricted to be associated with the possible complex entities that give rise to biological functionality. Thus, the chemical analogy was lost because the activity, or lack of it, was related in a complex manner to the presence of a large number of "statistical" rather than "biological" keys.
This problem is addressed in our methodology. Indeed, we extended the number and size of potential keys but restricted them to biologically relevant ones by discriminant analysis. This resulted in the selection of only a few, but most appropriate, substructural descriptors (e.g., biophores) (17). To a degree, the program has intelligence, since it performs the painstaking task of selecting the appropriate descriptors usually performed by the researcher. Most importantly, though, it establishes causality and, in so doing, lays the ground to an understanding of the mechanism of actions of the toxic materials.
All functions of the program are performed automatically. The input consists merely of the KLN code (18) of the molecules of the training set and an evaluation on a scale of 1 to 9 of their activity. If, however, in response to the query for the activity of a new molecule, a question mark is entered, the program enters in its predictive mode and uses all the information that was fed to it previously to project the percentage chance that the new molecule is active. New data can be entered at any time and is immediately incorporated in the analysis of the problem. Thus, the program has learning capabilities as well.
We have already applied this methodology to a number of data bases and obtained good qualitative correlations for the carcinogenicity of polycyclic aromatic hydrocarbons and N-nitrosamines (10) as well as the genetic toxicity of nitroaromatic derivatives (17).
In our initial implementation of the method, biophores were identified on the grounds that they are capable of bestowing toxicity to a molecule; we now propose to carry the program one step further by evaluating the potency of each of the biophores. Such a technique had been pioneered by Free and Wilson (19) and is currently implemented in a number of computer packages. Our approach consists in tabulating the presence/absence of the appropriate keys in a training set of molecules and relating their presence via linear regression to the known quantitative activity of the molecules in which they are present.
Here, again, our implementation is totally automatic. Indeed, once the biophores have been identified, the program proceeds to tabulate the presence/absence of these descriptors in the various molecules of the data base. It then proceeds to evaluate each of the descriptors in the context of a linear regression analysis and selects the most relevant one in a forward selection procedure.
In order to achieve the most accurate description, a new procedure for handling inactive molecules has been implemented. In this procedure, the inactive molecules are assigned a floating activity index whose value cannot exceed the value generally assigned to inactive molecules, but can vary downward as needed to produce the best linear plot. This permits us to include the inactive molecules in the regression analysis without the need of defining an "inactivity" ranking.
The result of this procedure is that the minimum number of descriptors, necessary to calculate the toxicity of the entire data set, is identified. The coefficients of the descriptors, which are seen as their potency, are automatically interfaced with the other data of the CASE program so that when a new compound not present in the initial data base is entered, in the predictive mode, a qualitative projection of the expected activity of the compound can also be made.
The program can clearly be used to study congeneric molecules. The interesting possibility exists that it can also address a diverse data base. Indeed, as long as the endpoint is well defined and the mechanism pretty well constant, nothing seems to prevent the possibility that enough different biophores are identified to accommodate the diversity of test compounds. Such a capability would present extraordinary opportunities to study a vast number of molecules. As yet, though, we have not built a sufficient data base to prove that this is indeed the case. Truly diverse data bases are rare; most of the time, they consist of combinations of data bases of congeneric molecules.
We describe below the application of the CASE program to two congeneric toxic data bases: the carcinogenicity of polycyclic aromatic hydrocarbons and that of N-nitrosamines. Then, in an initial effort to build a diverse data base, we have combined the two data bases and added a series of additional paraffinic molecules to see if a deterioration of the results takes place. To a degree, though, this is not a fair test, because the data bases are largely orthogonal. Our objective is to continue to incorporate additional molecules as we acquire appropriate data, and eventually build a truly diverse data base to be used to predict genetic toxicity of unknown molecules.

Results and Discussion
Polycyclic Aromatic Hydrocarbons Polycyclic aromatic hydrocarbons (PAHs) are produced by combustion processes and are ubiquitous in the environment. Their carcinogenicity has been known for a long time and many attempts have been made at  correlating their perceived genetic toxicity with their structure (5,20,21). It is now believed that their mechanism of action involves a "bay region" oxidation followed by alkylation of some genetic material (22). We have previously (10) been able to obtain a good qualitative correlation between the carcinogenicity of these hydrocarbons, as reported by Dipple (23), and a few structural descriptors. We have now extended the treatment to a quantitative evaluation and have found the results shown in Table 1.
The data consisted of the 43 unsubstituted PAHs reported by Dipple as having no, marginal, moderate, or high carcinogenicity when painted on or injected in mice and rats. The data was reviewed and evaluated in a previous paper (10). As shown in Table 1, five descriptors were automatically selected by the program as being relevant to activity. Three of them are biophores, i.e., cause activity, while descriptors 4 and 5 are biophobes, i.e., prevent activity. While biophore 1, which describes essentially a "bay" region, is seen to be the most prevalent (i.e., it appears first in the list), biophore 3 appears to be the most potent (coefficient = 1.7). The latter represents the substructural unit (I).

II
The structure is clearly part of a benz(a)pyrene entity. Coupled with biophore 1, the bay region, it clearly bestows high activity to a molecule.  Biophobe 4 defines an L region, clearly unfavorable, while biophobe 5 defines a kind of a K region (II), which is a rather surprising result. These effects, though, can be rationalized by considering that the two biophobes identify highly reactive regions that, if present, compete with the bay region for initial metabolism.
The correlation is satisfactory; R = 0.832 while the standard deviation is 0.813, i.e., less than one unit of activity. All inactive compounds were correctly assigned, but seven of the active compounds were not identified as carcinogens even though they were found experimentally to be moderately active.
Here also, we had previous experience with the data (25), having used part of it to evaluate qualitatively the toxicity of the cyclic compounds. We have now extended the data base to include 69 N-nitrosamines, both cyclic and acyclic. The results appear in Table 2. The program selected the four descriptors shown below Table 2.
All except one are biophores, reflecting the large proportion of N-nitrosamines that showed activity. The only biophobe, CH-C =0, indicates that the presence of a carbonyl group in any of the molecules prevented activity. Biophores 2 and 3 indicate that, to be carcinogenic, the carbon atoms alpha to the N-nitroso group must bear at least two hydrogen atoms, a fact that is compatible with the preferred mechanism of action of these molecules (26). Biophore 3 is the most potent (coefficient = 1.7), indicating that methylation is probably the most toxic of the possible alkylation events.
The program recognized all but one active molecule; however, it predicted eight of the inactive ones to be active. This may be indicative of the fact that some additional biophobe might not have been identified properly. An alternative explanation, though, could also be that some additional important descriptor, for example the partition coefficient, plays an important role in the activity of the molecules.

Miscellaneous Data Base
In this data base, we included both the data for the PAHs and the N-nitrosamines. We also added 20 miscellaneous paraffins, believed to be inert, and submitted the total 127 compounds to CASE analysis. To a degree, the three data bases that constitute this miscellaneous group are not consistent, since the endpoint had been evaluated differently in each of them. Nevertheless, the property they represent is sufficiently well defined to warrant an attempt at normalization. The program picked up five descriptors of which four were biophores and one a biophobe. The results appear in Table 3.
As can be seen, all descriptors were already represented in the individual QSARs. Thus the data base (Table 4) is not truly diverse, since the structures of the three categories of compounds are largely orthogonal. Nevertheless, as a first step towards the development of a QSAR for diverse compounds, the results are of some significance.
The index of correlation R was found to be 0.800 while the standard deviation of residuals is now close to one activity unit, i.e., 0.969. The number of incorrect predictions is 17, i.e., about 15% of the total number of molecules in the data base. This is probably the lower limit of what can be expected from predictions derived from such calculations. Close examination of the data reveals that the errors in the global QSAR strongly parallel those found in the individual ones. This again is an indication of the orthogonality of the congeneric data bases that were used to make it up.

Conclusion
The method works well and appears useful to correlate and possibly predict the activity of congeneric molecules. Further work is needed to assess the potential of this method for general application to diverse data sets.
The process is totally automatic and the QSAR are obtained merely by requesting such evaluation to be made. The nature of the descriptors is clearly related to the activity of the molecule, thus providing the impetus to initiate other types of QSAR applications, based for example, on the quantum mechanical indices of the atoms of the substructures that are selected as being relevant to activity. We plan to investigate these avenues in the future.
The author wishes to thank Professor H. Rosenkranz for stimulating this work, for helping with numerous suggestions, and providing much of the needed biological data. The author is also grateful to the Environmental Protection Agency for their financial support for the study of structure-activity relationships in potential carcinogens (Grant RS100623). The data base for nitrosamines was provided by Dr. W. Lijinsky, Chemical Carcinogesis Program, Frederick Cancer Research Center, Frederick, MD 21701 (USA) (1978).