Analysis and comparison of information and data recorded in carcinogenicity and genotoxicity databases.

The Interlab Project is a university-industry joint project recently funded by the Italian government as part of the improvement of the Italian research infrastructure; among its short-term goals are the implementation of data banks of biomedical interest and the spread of informatic tools for biomedical research. Results of both long-term assays of carcinogenicity in rodents and short-term in vitro and in vivo tests of genotoxicity are relevant for a wide body of users, ranging from carcinogenesis research laboratories to industries and governmental agencies. To evaluate the most appropriate ways of spreading information on these experiments, a detailed analysis on information recorded in available databases has been carried out. Furthermore, the contents of the most known databases have been compared, with respect to a specific compound, to evaluate both the overall reliability of these systems, compared to longer and more complex assessments carried out manually starting from bibliographic searches, and the level of concordance among them.


Introduction
Telematics, the synergic use of informatics and communication systems to improve the transmission and sharing of data among computers at an international level, has had an enormous impact on the research environment. In recent years, a number ofnetworks have been setup, and many on-line data banks have been created. Toxicity research, in particular that relating to carcinogenicity and genotoxicity, have become involved and somehow benefited from these initiatives. Moreover, the dramatic improvement ofelectronic technologies that has led to the design of high-performance, low-cost computers, and the sharpening of software methodologies, which in turn has led to the development of standardized database management systems, has given rise to the establishment ofmany databases on the same topics.
Thus, at the moment, a number of different sources for carcinogenicity and genotoxicity data is available. Ifthe former are relatively few, this is not the case for the latter, since many new results are continuously being published. Availability of these tools to the end users has been guaranteed by means ofa number of systems, ranging from literature reports to personal computer software, on-line data banks and CD-ROM (compact disk-readonly memory).

Interlab Project
The Interlab Project is a university-industryjoint project (1). The promoting institutions for the project are the National Institute for Cancer Research (IST) of Genoa; the InterUniversity Center for Cancer Research (CIRC), grouping five Italian universities; and Ansaldo SpA, an Italian leader in information systems. The project was funded on a 2-year scheme inJune 1989 by the Italian Ministry for University and Scientific and Technological Research, with the goal of improving the Italian research infrastructure.
The main objectives of Interlab are to improve existing collaborative links among biomedical research centers operating in kindred fields and to spread the use ofcomputer tools devoted to this research area in Italy, where background in informatics and awareness ofits relevance are still lacking. A communication net-work has been set up to allow for easy and steady communica-tion among institutes, whose researchers can exchange messages by means ofan electronic mail system. It is planned that, in the future, a bulletin board system will also be available, as well as a custom service for bibliographic searches.
Furthermore, centralized, on-line factual data banks on availability ofbiological material in Italian laboratories have been implemented to allow for quick, exhaustive, and easy retrievals ofconstantly updated information in research areas in which few data were available. Personal computer versions of these databases are being created to help researchers maintain their collections of biological materials and guarantee a steady flow of up-to-date information.
Among the main design criteria, two are particularly worth mentioning because they give the system its specificity. These criteria are the use of the relational approach in defining data structures and the particular care in designing a friendly user interface. The relational approach has been adopted instead ofthe more traditional information retrieval approach in consideration of the nature of the information to be recorded.
Apart from its theoretical basis, the mostevidentcharacteristic ofthe data bank is that almost all data are coded and recorded in structured fields. This leads tothe creation ofcomplexdata structures by means ofwhich "normal data formats," i.e., formats in whichdata are recorded withoutany redundancies, canbeobtained. One oftheadvantagesofthis approach is that searches canonly be carried out in specific contexts, i.e., with respect to a specific information, thusbothavoidingconfusionarising fromcoincidental correspondence ofterms and guaranteeing their exhaustiveness. Furthermore, searches are executed in a very efficient way and can thus be performed on almost all kinds ofcomputers, independentoftheir speed andcapacity. Anautomaticvalidationof data being inserted can be carried out by the system as well.
Moreover, since the relational approach leads to specific data structures, any ad hoc queries, specifying which terms can be searched and in which context, must be defined, and ad hoc applicative software, aimed to the creation of the user interface, must be developed.
The databases have been implemented on a Unix-based microcomputer, and the relational databaw management system Oracle, a commercial software available worldwide, has been adopted. Essentially, Oracle has been chosen for its wide spectrum of versions, ranging from personal computers to mainframes, and for its good modularity and portability that make the creation of database versions for other computers and of new releases easier and quicker.
The user interface is always a fundamental part ofthe system because it determines the real accessibility of the data. In our case, it also had to be easy to use and as clear as possible for people without specific skills in informatics. It has been carefully designed for general features, which are valid for all the applications, and for specific features, which are valid only in specific contexts, such as insertion and query.
Apart from having masked the Unix operating system and SQL (Structured Query Language, the standard query fhnguage for relational systems) to the end users, extensive use ofmenus and ofcontextual helps has been made. Furthermore, to simplify the interaction between the user and the applications, a limited number of function keys is used and, when possible, the words taken from the informaticsjargon, like field, record, and block, have been substituted with more widely used terms.
Among user interface specific features, particularly relevant are those devoted to the optimization of the use of controlled vocabularies, such as the extensive use of mnemonic codes instead of the complete terms and an automatic display of the list ofthe available items. Moreover, data are validated during insertion, queries are defined according to high-level macroinformation, and data are presented in coherent subsets.
Until now, three databases have been implemented and are available on-line. The first one relates to cell lines (CLDB), the second to HLA-typed B-lymphoblastoid cell lines (BLDB), and the third to oligonucleotides (MPDB). CLDB contains data on 720 cell lines available in Italian laboratories. More than the 60% ofthese lines are original, that is, not described in any other commercial or scientific catalog. In fact, CLDB data collection highlighted the presence ofmany well-characterized, small collections ofcell lines. CLDB data structure is quite complex and is based on two substructures. The first relates to information that univocally identifies the cell line, the second to information that is specific for a laboratory in which the cell line is collected. Among the former are the name, the origin (species, strain, sex, etc.) and possible transformations; among the latter, it considers culture conditions and validation assays performed. Controlled vocabularies have been defined for most information. Among them are species and relative strains, morphologies, tumors, transforming agents, applications, and functions.
Searches can be carried out using three different approaches: by name, by origin, and by function. Using the first approach, the search can be conditioned on the basis of the name, the presence in a given catalog and/or the identification code in a catalog. The query by origin can be used to retrieve cell lines having given species, strain, tissue, tumor, and pathology. Finally, the query by function relates to cell lines applications and specific functions. A new approach, based on a query related to the transforming agent, will be added in the near future. Following the retrieval of desired cell lines, information can be displayed according to coherent subsets, which are identification, origin, specification, ownership and culture data. Both detailed and synthetic reports can be generated with reference to one single cell line or many cell lines having some common characteristics. BLDB and MPDB have been designed using the same criteria. BLDB contains data on approximately 750 B-lymphoblastoid lines available from the laboratories ofthe European Collection for Biomedical Research (Essen, Germany and Genoa, Italy). Data from two other European collections are being added. At the moment, the prototype of MPDB contains data on oligonucleotides produced by the internal service ofthe National Institute for Cancer Research of Genoa. It will be flanked by a service for the production of custom oligonucleotides.

Carcinogenicity and Genotoxicity Databases
There are many carcinogenicity and/or genotoxicity databases that are available to the end users. Some of them are hosted by computers oflarge information companies and can be searched on line. Others are not available on line, but their complete, upto-date dump can be obtained from database administrators on floppy disks or tapes, at times with some ad hoc software, that allows for their management. Finally, some are available to the end users only in a printed format.
Often, when biologists want to search these databases, they are not fully aware of the main goals and specificities of each of them. The databases included in this analysis have been chosen on the basis of their availability and relevance, relevance measured in terms ofquantity ofdata, promoting institution, international agencies involvement and geographic origin, and have been analyzed from the point ofview ofthese unpracticed biologists.
In regard to on-line databases, Registry of Toxic Effects of Chemical Substances (RTECS) (2), Chemical Carcinogenesis Research Information System (CCRIS) (3), and Environmental Chemical Data and Information Network (ECDIN) (3) have been considered. RTECS is a factual, nonbibliographic data bank, built and maintained by the National Institute for Occupational Safety and Health (NIOSH) and hosted by a number of well-known host computers in the United States, Europe, and Australia; it contains mutagenesis studies on nearly 10,000 substances and carcinogenesis studies on about 3,400 substances. CCRIS is hosted by Chemical Information Systems, Inc. (CIS) and the National Library of Medicine (NLM); it contains carcinogenicity, co-carcinogenicity, and mutagenicity data on 1,269 substances. ECDIN is a factual data bank, created by the Joint Research Centre (JRC) of the Commission of the European Communities (CEC) at Ispra, Italy; it is hosted by Datacentralen.
Even ifthese data banks have different objectives, they are all comprehensive inthe sensethatthey present both carcinogenicity and genotoxicity data. In addition to these, another comprehensive database has been included in the analysis, the Biological Database (BL-DB) (4), although it is not available on line. It is a factdatabase, containingdataonmutagenicity andcarcinogenicity.
Databasesthat are specific for carcinogenicity or genotoxicity have also been included in the analysis. They are, respectively, Carcinogenic Potency Database (CPDB) (5) and the Gene-Tox Carcinogen Database (CTJX2DB) (6) for carcinogenicity and the Genetic Activity Profile Database (GAP) (7,8) and the GEN Database (GEN) (9) for genotoxicity. CPDB contains standardized data on 4000 animal experiments with about 1000 chemical compounds. GTCDB contains data on more than 500 selected chemicals. GAP provides activity profiles and corresponding listings of data and references for each chemical analyzed. Two data sets are included in the GAP software: one related to the International Agency for Research on Cancer (IARC) (277 agents) and one related to the U.S. Environmental Protection Agency (EPA) (167 agents).
Selection ofdata for the analysis has been carried out manually for CPDB and GJXDB and on the basis of ad hoc management software distributed with the databases for GAP and GEN.

Objectives and Results
The research was mainly aimed at a) comparing the types of information recorded in each database, b) determining a common, basic data set, c) evaluating data overlapping between databases, d) evaluating general agreement/disagreement among results reported by different databases, and e) evaluating general reliability ofdatabases, as tools able to provide the basis for rapid and efficient synthetic evaluations on a given chemical or groups of chemicals.
A detailed analysis on information recorded in available carcinogenicity databases has been carried out. Data have been sub-divided into two groups, devoted, respectively, to the description of the compound and the experiments, and each of these into many coherent subgroups (Tables 1 and 2).
As far as the description ofthe compound (Table 1) is concerned, although identification data seem adequate for all the databases, the other subgroups ofinfornation are lacking. In particular, other physicochemical parameters that could be relevant for carcinogenicity, such as hydrophobicity and various types of structural alerts, and pharmacokinetics data are not reported at all. Furthermore, recognized evaluations, such as [ARC and National Cancer Institute/National Toxicology Program (NCI/NTP) classifications, are generally neither listed nor sufficiently highlighted. Supplementary information, which could be relevant for accessing compound data on the basis of the chemical class or use, are reported very rarely. Finally, links to other databases are almost completely absent.
In regard to experiment description (Table 2), the set ofinformation taken into account is almost the same for all the databases, but, with the exception ofspecies, strains and sex ofthe animals and route of administration ofthe compound, they are described in a number of different, nonstandardized ways. This is particularly evident for information related to experimental design and results. Other information on results, such as tumor latency, which is relevant for risk assessment, are normally absent. Quantitative evaluations are rarely present and bibliographic references are not standardized.
Data overlapping has been evaluated by comparing citations and single long-term animal experiments reported by each carcinogenicity database. To this end, a specific compound (benzene) has been chosen and all related information has been selected from the databases and analyzed (Tables 3-5). Every carcinogenicity study singly identifiable on the basis of bibliographic reference, species, strain, sex and route ofadministration has been considered as a separate experiment. Data show that NCI/NTP experiments are normally listed, even ifnot all the experiments that were carried out in this context are reported (Table 3). Some misunderstanding can arise in regard to IARC monographs, which being surveys, do not list any original experiment: in two cases a monograph was reported as an original reference for the experiment, while, in the others, references to original experiments reported also in an IARC monograph were given. Apart from NCI/NTP technical reports and IARC monographs, databases reported, in most cases, a great number of original experiments (ranging from 7 to 12), but the overlap was poor: only 5 out of 37 experiments were reported in more than one database. Total experiment redundancy (still excluding those of the NCI/NTP and IARC), corresponding to the percentage of experiments reported more than once, is thus about 14%.
A similar situation can be shown by analyzing citations only (Table 4). In this case, since ambiguities possibly arising from different interpretation ofresults shown in papers are absent, data are more readable. Total redundancy, corresponding to the percentage ofreferences cited in more than one database, is ca. 22%. Citations and citing databases are reported in Table 5.
The low redundancy that has been found can be explained on the basis ofmany different reasons. One possible explanation is that more than one paper can present and discuss the same original experiment, possibly with some marginal updating or deeper analysis. In this case, the same data could be inserted in bFree-text description of species and strain. cAge, weight.
dFree-text description ofnonstandardized doses and duration.
cFree-text description ofduration and either lowest dosage inducing a significant increase in tumor incidence or dosage inducing a significant increase in tumor incidence.
fSeparate description of single dose, total dose, and comment on dose.
gSeparate description of frequency and duration of administration.
hFree-text description of target organ and tumor. 'Free-text description of target organ, tumor, and effects.
'Tabular format and short text description.
kAuthor's opinion, if stated.    Though these reasons can help to understand the reasons for the differences that we have pointed out, it should nonetheless be taken into account that the main goal of factual databases is to give end users sufficient data without reading the original papers. From this point ofview, the current situation could be misleading and could give the impression that there are more data than in reality and could lead to overestimating experiments reported more than once.
In regardtotheevaluation ofagreementamong results reported by different databases, genotoxicity databases have been compared, still with respect to benzene; this compound shows a somewhatpuzzling behavior (Tables 6-9). Even if, considering thewholesetofexperimentsreportedbythelARCdatasetofthe GAPdatbas, thecompoundshouldbeconsideredasprevailinglynegativebecauseithasonlya 25 % positive result rate (Table  6), aclearpositiveness is shown fori vivotests, hererepresented only by chromosomal damage assays. More specifically, relatively few in vitro DNA damage short-term experiments show a clearly negative pattern, with positive results constantly lower than 27%. Conversely, chromosomal damage experiments show a clear negative behavior for in vitro tests, both with and widtout metabolic activation, and a clearly positive one    Table 6. No DNA damage and chromosomal damage data were available on CCRIS. RTECS was not considered because it does not report effects on single genotoxicity tests. Abbreviations: GAP, Genetic Activity Profile database; GEN, GEN database. aFor each end point, total negative, inconclusive (I), and positive figures are reported. Each test system has been considered only once and ifthe same test was performed in more than one experiment, an overall evaluation is reported. For GAP data, the test has been considered negative when less than 25 % of results were positive and otherwise positive. GEN data are already listed as overall evaluations in the database. for in vivo tests; the percentage ofpositive results ranges from 20 to 30% in the former case and attains 85% in the second, with a general mean of 44%. Finally, mutation experiments show a clearly negative behavior, with percentages of positive results ranging from 14 to 19%. These results are substantially confirmed by comparison with other databases (Tables 7 and 8), although fewer data are available. Although RTECS cannot be compared because it does not report results ofsingle experiments, it is possible for ECDIN and CCRIS, and data available confirm benzene behavior for both mutation and chromosomal damage assays.
To compare GAP and GEN data, the former must be reexamined. In fact, instead of listing the results of all published experiments, GEN reports an overall evaluation of all experiments related to one specific assay. Re-examination ofGAP data has been carried out considering a test negative when more than 75 % of the results were negative, inconclusive when less than 50% of the results were positive, and positive otherwise.
Even if much fewer data are available after this re-examination (Table 8), the previously shown behavior ofbenzene is substantially maintained and confirmed by both databases.
This substantial consistency of experiments on benzene reported on many different databases demonstrates that, although some ofthem are lacking for particular tpes oftests, end users can trust databases to simplify, improve, and speed up their work. The great differences existing among databases, nevertheless, indicate the necessity of considering some databases as more reliable than others.
The benzene activity profile in short-term genotoxicity assays, as provided by databases, has also been compared to the results ofan extensive continuous bibliographic search ( Table 9) that is being carried out by researchers of the Institute of Oncology of the University ofBologna (S. Grilli, personal communication). The result ofbibliographic search (BS) and analysis lists 217 experiments on benzene versus the 157 reported in GAP, i.e., approximately 38% more. Thedistributionoftheseassays(Table9) highlights that these are not extra references not yet included in GAP, butthatthetwo setsofdata are different. Indeed, for example, BS reports94 invivo chromosomaldamageexperiments and 32 in vitro chromosomaldamage experiments, whileGAPreports only 27 and 58, respectively. The overall ratio of positive experimentsis higher for BS than for GAP (62% against 31%). This is only partly due to the presence of 10 clearly positive in vivo DNA damage experiments. In fact, there is a constant higher ratio of positive results in BS than in GAP for all groups of experiments. This does not modify existing differences between in vitro and in vio assays and between mutation and chromosomal damage assays. This comparison, however, seems to suggest that experiences and knowledge present in some institutions should not be missed and that a multinode data input scheme, in which every node is in charge ofinsertion ofdata relative to its main expertise, would be preferable ifcommon criteria and standards of quality could be achieved.

Conclusions
Types of information recorded into six different carcinogenicity databases, three of which are already available on-line, have been compared to verify if a common format was used, thus allowing data interchange, and to identify a basic data set. This comparison showed an extremely diversified situation in which the physicochemical characterization of chemical compounds, as well as overall evaluations and inter-database references are poor. Furthermore, the description ofthe experiments were extremely variable and nonstandardized. Suggestions on information, not yet taken into account, but relevant in view ofa unified data set for carcinogenicity, have been given. The comparison of experiments on benzene and of respective bibliographic references reported by carcinogenicity databases showed that each of them lists many original works not reported by any other database, thus producing a low redundancy of references. Explanations for this unexpected result have been proposed, though this diversified reality is actually what appears to end users.
Results reported on benzene by four different genotoxicity databases have been compared to verify the general agreement among them and their overall reliability. Results showed the same, well-known, global activity profile for all the databases, though only GAP seemed to present data on all different test systems.
Finally, GAP genotoxicity results for benzene have been compared with data obtained by means ofa continuous bibliographic search. This comparison showed, on one hand, that GAP substantially presents a true image of reality and, on the other, that also for a good database, a great percentage of short-term experiments can still be missed.
In conclusion, this work shows that a) databases can be extremely useful for researchers in the fields ofcarcinogenicity and genotoxicity because they can unambiguously represent reality and prevent, at the same time, long and expensive surveys ofthe original data, b) a common basic data set for carcinogenicity and genotoxicity does not yet exist and data cannot be exchanged easily among databases, c) the availability of many databases does not help the end users, instead it can create misunderstanding, overestimation, or confusion, and d) an effort should be made to define a common reference format, identify, and support the best databases, even by multiplying the input nodes, to achieve database exhaustiveness. This wvrk has been partially funded by the Italian Ministry of University and of Scientific and Technological Research within the sphere ofthe Interlab Project. The authorsthank ProfessorS. Grilli and Dr. A. M. Colacci ofthe Institute ofOncology ofthe University ofBologna, Italy, for their kindness in providing data, which they are continuously collecting from published literaure, on shortterm tests on benzene. The authors also thank Dr. M. Evangelisti and Dr. A. Bogliolo ofthe Scientific Information and Documentation Service ofthe Library of the National Institute for Cancer Research of Genoa for support in carrying out searches about on-line databases.