Laying a Community-Based Foundation for Data-Driven Semantic Standards in Environmental Health Sciences
Despite increasing availability of environmental health science (EHS) data, development, and implementation of relevant semantic standards, such as ontologies or hierarchical vocabularies, has lagged. Consequently, integration and analysis of information needed to better model environmental influences on human health remains a significant challenge.
We aimed to identify a committed community and mechanisms needed to develop EHS semantic standards that will advance understanding about the impacts of environmental exposures on human disease.
The National Institute of Environmental Health Sciences sponsored the “Workshop for the Development of a Framework for Environmental Health Science Language” hosted at North Carolina State University on 15–16 September 2014. Through the assembly of data generators, users, publishers, and funders, we aimed to develop a foundation for enabling the development of community-based and data-driven standards that will ultimately improve standardization, sharing, and interoperability of EHS information.
Creating and maintaining an EHS common language is a continuous and iterative process, requiring community building around research interests and needs, enabling integration and reuse of existing data, and providing a low barrier of access for researchers needing to use or extend such a resource.
Recommendations included developing a community-supported web-based toolkit that would enable a) collaborative development of EHS research questions and use cases, b) construction of user-friendly tools for searching and extending existing semantic resources, c) education and guidance about standards and their implementation, and d) creation of a plan for governance and sustainability.
Mattingly CJ, Boyles R, Lawler CP, Haugen AC, Dearry A, Haendel M. 2016. Laying a community-based foundation for data-driven semantic standards in environmental health sciences. Environ Health Perspect 124:1136–1140; http://dx.doi.org/10.1289/ehp.1510438
This review is derived from a workshop held at North Carolina State University, Raleigh, North Carolina, USA, on 15–16 September 2014. Sharing, analysis and integration of environmental health science (EHS) data is limited by a lack of data standards, in particular, common language standards. Language standards are shared vocabularies that are used for data annotation and common data elements specification to aid interoperability. They may be as complex as an ontology, whereby the terms and the relations between them are defined using logic and are expressed in computable languages such as the Web Ontology Language (OWL 2016), or, they may be as simple as a hierarchical vocabulary. This workshop aimed to a) articulate research areas that would be advanced by EHS language standards and data interoperability, b) identify a community to initiate the creation and champion the extension of EHS language standards, and c) develop guidelines for the development of EHS standards.
Exposure to environmental factors significantly impacts human health. The environment, broadly defined, can range from everyday products (e.g., toothpaste) to hazardous materials (e.g., open pit mining sites) and socioeconomic stressors. Consideration of this spectrum is needed to better understand how, when, and to whom exposures pose health risks. There is an enormity of available data that, if structured and integrated, could be leveraged to inform mechanistic hypotheses, therapeutic approaches, and policy making. However, a lack of semantic standards has been a major barrier to data sharing and integration (van Panhuis et al. 2014). This need for semantic standards is being recognized in many areas of biomedical research. For example, the National Research Council’s report titled “Toward Precision Medicine” called for clinical and research advancements based upon systems that would be enabled by a new language standard (NRC 2011). The authors of this report—Committee on a Framework for Developing a New Taxonomy of Disease, Board on Life Sciences, and Division on Earth and Life Studies—determined that “The rise of data-intensive biology, advances in information technology, and changes in the way health care is delivered have created a compelling opportunity to improve the diagnosis and treatment of disease by developing a Knowledge Network, and associated New Taxonomy, that would integrate biological, patient, and outcomes data on a scale hitherto beyond our reach” (NRC 2011).
Development of semantic standards, such as logically constructed ontologies, EHS data, and integration of this effort within the broader biomedical context through crosscutting research programs, such as the Exposome (Wild 2005) and Big Data to Knowledge (BD2K) (Margolis et al. 2014), will enhance the capacity to inform disease research with environmental data while also improving understanding of environmental impacts on human disease. The lack of language standards and their consistent implementation affects not only the capacity to analyze across diverse data sets, but even hinders the ability to identify available data sets, limiting the value of potentially important scientific findings. A query of microbiome samples using PubMed from the National Center for Biotechnology Information (NCBI 2016) illustrates the variability in results that stem from a lack of harmonized language standards and annotation of data using such standards (Table 1). Standardization has the potential to benefit many areas of biomedical science by augmenting discovery and reuse (Richesson and Nadkarni 2011; Tenopir et al. 2015; Zimmerman 2008).
|Query||No. of results|
|Stool NOT faeces||21,798|
|Stool NOT feces||18,314|
A few projects have specifically demonstrated the potential of adopting standards to advance EHS data integration, research, and discovery. For example, the Oceans and Human Health program [supported by the National Institute of Environmental Health Sciences (NIEHS) and the National Science Foundation] links oceanographic and metagenomics data sets (NCBI’s Sequence Read Archive, Metagenomic Rapid Annotations using Subsystems Technology) (NCBI-SRA 2015; Youngblood et al. 2014), and custom public health databases (Antibiotic Resistance Database, Computer Access to Research on Dietary Supplements Database) (Liu and Pop 2009; NIH 2015) using ontologies to provide an innovative, health-based framing for oceanographic observatories (microbial diversity and antibiotic resistance of ocean ecosystems) (Port et al. 2012, 2014). The Comparative Toxicogenomics Database (CTD) (Davis et al. 2015) provides integrated information about chemicals, genes and proteins, phenotypes, diseases, and exposures to provide mechanistic insights into the effects of the environment on human health (Davis et al. 2015). Data are annotated and integrated using public ontologies describing chemicals (MeSH) (NLM 2015b), genes and proteins (Entrez Gene) (NLM 2015a), diseases (MeSH), and interactions (CTD interaction ontology) (Davis et al. 2015). Consequently, users may query cross-species mechanistic data for specific or broad classes of chemicals and identify associated diseases or disease models. Broader development and adoption of EHS standards will be necessary to ensure access, re-use, innovative integration, and ongoing re-analysis of data that describe the complex interactions between the environment and human health.
Gaps in EHS Semantic Standards
The data standardization needs within EHS are diverse and include genomics, metabolomics, chemistry, toxicology, epidemiology, exposure science, phenotypes, geospatial data, and clinical health records among others. While some of these components are better standardized than others (e.g., genomics) and not necessarily specific to EHS, it is the need for integration across these diverse entities in order to better model the complexity of environmental health interactions that is unique. In apparent contradiction, there are a large number of existing standards (Tenenbaum et al. 2014), yet often the needed content is missing, occurs redundantly in more than one context, or cannot be found. Although there are several public resources that have centralized some publicly available semantic vocabulary standards and ontologies (OboFoundry, NCBO BioPortal, Biosharing.org, Ontobee) (Biosharing 2015; NCBO 2015; Smith et al. 2007; Xiang et al. 2011), there is still limited capacity for the community to identify the concepts they need across the spectrum, contribute in such a way that reduces redundancy and enhances existing standards, and easily compare the content between selected standards. In addition, few of these resources are associated with the data that are annotated using the ontologies or vocabularies. This disconnect leads to semantic standards that are not necessarily built fit-for-purpose and lack examples that would help users determine which standards would be most appropriate for their needs. There is a need for a tool in this space to inform decision making about the incorporation of an existing standard, the need to extend such resources, or create and coordinate new standards. Critical to this decision making is the need to link to existing data sets in which semantic standards have been applied and understand the impacts of standards use and evolution on downstream data analyses. Further, EHS needs to incorporate emerging biomedical concepts (e.g., the exposome) that are not adequately represented among existing vocabulary resources. Consequently, there is a need for tools that allow community-based development of new standards, such as in cases of emerging research areas.
A critical component of development and adoption of semantic standards is community agreement on the meaning of terms and their use in different contexts. Gaining agreement is often difficult and imperfect, and consideration should be given to achieving agreement where there is a natural propensity, whether at a specific level of detail or around specific concepts. Semantic disagreements can be due to community diversity, overspecification of terms, or changes in the meaning of terms over time. In cases where agreement cannot be achieved, community-specific synonyms must be incorporated to avoid limiting the utility of the standard or stalling future development. Furthermore, once a standard is available, its value is largely determined by the datasets and projects that adopt it. Wide adoption of standards is best achieved when diverse constituencies, such as data generators, data users, standards developers, publishers, and government agencies are involved and incentivized to participate in community education, participation, and tool building. New tools are needed to cultivate a greater degree of collaborative development.
The Gene Ontology (GO) (Ashburner et al. 2000) is often referenced as a gold standard for ontology-based initiatives by virtue of its global community participation and implementation, development of tools to browse and access content, and its impact on data integration and analysis; however, it had humble beginnings, and there is much to be learned from its early roots and subsequent path. Developed with input from an international consortium to represent how genes encode biological functions at the molecular, cellular, and tissue system levels across diverse species, GO now describes more than 40,000 biological concepts (GO 2015). GO annotations are incorporated into countless biological resources and it has been cited in over 100,000 peer-reviewed articles (GO 2015). GO has enabled integrative analyses that are now common in genomic experiments, such as gene set enrichment. Drawing upon GO, the following successful features of a process for developing semantic standards were identified:
Start with simple and practical initiatives.
Utilize a modular building block approach to allow for flexibility and reuse.
Leverage and interoperate with existing standards where possible.
Develop language standards to work with scientific uncertainty.
Find balance between logic engineering and easier-to-use vocabulary editing.
Develop standards in close contact with the data and specific scientific need.
Focus on capturing scientific findings (i.e., durable facts).
Facilitate community-based collaborative curation of term definition and annotation.
Provide stable unique identifiers.
Incorporate significant time for community engagement and debate.
Provide accessible user interfaces for ongoing development.
In order to ensure buy-in and use of EHS standards, we provide the following eight recommendations for establishing a community willing to participate in the development of an EHS ontology and the resources needed to accomplish this development.
These guiding principles should be operationalized to serve as a resource for the EHS research community. A web-based toolkit could enable navigation of relevant standards from existing sources and serve as a collaborative infrastructure for community-based participatory research. Such a resource could include navigation not only of existing standards, but also the data within resources that leverage those standards. This connection would facilitate crowdsourcing approaches and tool development such as trackers, forum pages for the community to contribute use cases, and success stories. The intention of such a toolkit would be to complement and work synergistically to achieve an environmental health slice of existing standards efforts and technologies. For example, a project investigating the microbiome population and its response to different dietary and environmental exposures needed to standardize a) the microbiome species, b) the source from which the microbiome sample was taken (e.g., stool, mouth, etc.), c) a set of key nutrients, d) environmental contaminants, and e) disease and phenotypic characteristics at the time of sampling. The EHS toolkit could potentially go to the Human Microbiome Project (NIH HMP Working Group et al. 2009) to uniquely identify microbial strains, collect anatomical terms from the Uberon anatomy ontology (Haendel et al. 2014), foods from Wikipedia (2015), target chemicals from MeSH (NLM 2015b), diseases from the Disease Ontology (Schriml and Mitraka 2015), and phenotypes from the Human Phenotype Ontology project (Köhler et al. 2014). In choosing the terms, the user would want to see what data were already associated—for example, which phenotypes had been associated with the candidate disease? Which toxicants were found in the groundwater near certain population(s)? The output would be a logically constructed collection of vocabulary terms that could be used in the project, edited, and contributed back to the source resources, while maintaining provenance.
Development of an EHS toolkit would require expertise in technical standards development processes, such as software engineering that leverages the Web Ontology Language (OWL 2016). It would also require close collaboration with the various sources of vocabulary standards to support interoperability and coordination of community contributions, and environmental health related data resource developers. Finally, tools such as Web Protégé (WebProtege 2015) or Semantic Media Wiki (SMW 2015), if enhanced with functionality to meet the above needs, may potentially be utilized as web-based locations for collaborative editing, reviewing, and sharing the slices of the vocabulary standards.
Phased Approach to EHS Semantic Standards Development
There are several current challenges to development and broad adoption of EHS semantic standards including identification of an invested community, accessibility of semantic standards and development resources, and availability of funding to ensure ongoing support and sustainability. A major accomplishment of this workshop was identification of a community, composed of the workshop participants, who are committed to initiating and participating in a collaborative effort to develop EHS semantic standards. This community strongly recommended a) federal funding to ensure augmentation and adoption of these standards and b) interdependent and iterative phases of development described below.
To facilitate participation, data entry and automated validation tools for quality control assessment were recommended as part of the toolkit. One example of a validation tool is the Annotation Sufficiency Meter (Phenoday 2014) provided by the Monarch Initiative (Monarch 2015), which leverages diverse large-scale semantically integrated data. This validation tool allows clinicians or model organism researchers to enter phenotypic data at the point of care or in the lab, and then get back quality assurance metrics on their phenotype ontology annotations. It will be critical for those experienced in developing such resources to help develop tools that leverage language standards and data stores together. This integration will ensure that researchers benefit during the process of data creation, analysis, and publication from the use of language standards while simultaneously contributing to them.
A common problem for resource development projects such as databases or ontologies is the lack of dedicated and sustainable funding mechanisms. A paradigm shift by funding agencies and reviewers is needed such that development of data resources is not evaluated through the same lens of traditional hypothesis-driven research projects. Effective and broadly used semantic standards require a high level of scholarship and community involvement, result in major capacity-building impacts on research, and are increasingly recognized for their integral role in data analysis and integration; yet there are virtually no dedicated funding mechanisms for their development or sustainability. Dedicated funding mechanisms are needed as stand-alone or as ongoing research programs. For either mechanism, funding agencies should consider in advance how developed resources will be sustained long term and integrated into other ongoing research projects. To justify continued funding, metrics that reflect scientific value must be incorporated to track use (e.g., numbers of citations where semantic resources were used). Although seemingly straightforward, such metrics are challenging to compile because infrastructure and standards are generally not well cited, web-based tracking is not uniformly defined and can be wildly misleading, and new metrics are needed to properly credit infrastructure developers and collaborative teams that are not based solely on publications (NIH 2014). Many of these issues are not unique to EHS; however, the lack of semantic standards for EHS-specific areas (e.g., exposure-related contexts, chemicals) and the need for improved integration within the broader biomedical research landscape will only be rectified by the EHS community and associated funding.
It is an opportune time for the EHS community to help catalyze the development of standards given the increasing quantity and diversity of data that is poised to advance our understanding about environmental impacts on human health. Lessons from previous language standard development efforts emphasize the long-term nature of such endeavors, and that persistence and endurance are critical characteristics of successful efforts. Toward this end, sustaining community engagement is critical, and a phased approach is recommended: a) develop EHS research questions and use cases; b) identify existing language resources and build navigational tools to encourage adoption and extension; and c) determine a plan for governance and sustainability.
Clearly such advances will require dedication of resources, must address real needs, remain close to the data, and follow a sustained, but phased approach. In the coming months, NIEHS will pursue an engagement and outreach strategy providing a listserv for discussion and dissemination of materials, a research question and use case template, and a sample semantic standard inventory to be used in a community forum to give shape to the recommendations that have been described in this report. To contribute to this community, please register with the listserv at [email protected]
The authors, who comprised the core planning committee and user community, thank the James B. Hunt Library at the North Carolina State Univeristy for webcast support; the Institute for Emerging Issues for the venue and associated support; J. Solomon, B. Anderson, J. Collins, K. Moran, W. Freberg, and L. Skalla [National Institute of Environmental Health Sciences (NIEHS) Standards and Recommended Practices (SarPS) contract (GS-00F-0001S)] for meeting support, such as logistics, AV support, note-taking, editing, and travel reimbursements to workshop participants; J. McMurry for figure content; and the Environmental Health Science Language Workshop Working Group members—Y. Cui, S. Holmgren, L. Chadwick, and K. Thigpen-Tart—for help planning the workshop. Finally, the authors would like to acknowledge the dedication and enthusiasm of the attendees who collectively helped to clarify needs, stimulated discussion, and expressed their commitment to participating in future efforts to develop environmental health semantic standards.
This work was supported by the National Institutes of Health, the NIEHS, and the Office of the Associate Director for Data Science.
The authors declare they have no actual or potential competing financial interests.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM. 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.Nat Genet 25:25-29 10802651. Crossref, Medline, Google Scholar
- Biosharing. 2015. A Curated, Searchable Portal of Inter-related Data Standards, Databases, and Policies in the Life, Environmental and Biomedical Sciences.Available: https://www.biosharing.org [accessed 1 February 2016]. Google Scholar
- Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, et al. 2015. The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015.Nucleic Acids Res 43(database issue):D914-920
25326323. Medline, Google Scholar
- GitHub. 2015. Homepage.Available: https://github.com [accessed 10 October 2015]. Google Scholar
- GO (Gene Ontology Consortium). 2015. About: The Gene Ontology Project.Available: http://geneontology.org/page/about [accessed 10 October 2015]. Google Scholar
Haendel MA, Balhoff JP, Bastian FB, Blackburn DC, Blake JA, Bradford Y. 2014. Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon.J Biomed Semantics 5:21, doi: 10.1186/2041-1480-5-21 25009735. Crossref, Medline, Google Scholar
- Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, et al. 2014. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res 42(database issue):D966-D974
24217912. Medline, Google Scholar
- Liu B, Pop M. 2009. ARDB-Antibiotic Resistance Genes Database. Nucleic Acids Res 37(database issue):D443–D447.Available: http://ardb.cbcb.umd.edu [accessed 10 October 2015]. Google Scholar
Margolis R, Derr L, Dunn M, Huerta M, Larkin J, Sheehan J. 2014. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data.J Am Med Inform Assoc 21:957-958 25008006. Crossref, Medline, Google Scholar
- Monarch. 2015. The Monarch Initiative.Available: http://monarchinitiative.org [accessed 10 October 2015]. Google Scholar
- NIH (National Institutes of Health). 2014. Software Discovery Index Meeting Report – Request for Comments.Available: https://nciphub.org/resources/889/download/Software_Discovery_Index_Workshop_Report.pd [accessed 1 February 2016]. Google Scholar
- NIH. 2015. (Computer Access to Research on Dietary Supplements (CARDS) Database.Available: http://ods.od.nih.gov/Research/CARDS_Database.aspx [accessed 10 October 2015]. Google Scholar
- NIH. 2016. Children’s Health Exposure Analysis Resource (CHEAR).Available: http://www.niehs.nih.gov/research/supported/exposure/chear/ [accessed 1 February 2016]. Google Scholar
- NIH HMP Working Group, Peterson J, Garges S, Giovanni M, McInnes P, Wang L, et al. 2009. The NIH Human Microbiome Project.Genome Res 19:2317-2323
19819907. Crossref, Medline, Google Scholar
- NCBI (National Center for Biotechnology Information). 2016. Homepage.Available: http://www.ncbi.nlm.nih.gov [accessed 1 February 2016]. Google Scholar
- NCBI-SRA (National Center for Biotechnology Information Sequence Read Archive). 2015. Homepage.Available: http://www.ncbi.nlm.nih.gov/sra [accessed 10 October 2015]. Google Scholar
- NCBO (National Center for Biomedical Ontology). 2015. BioPortal Homepage.Available: http://bioportal.bioontology.org [accessed 10 October 2015]. Google Scholar
- NLM (U.S. National Library of Medicine). 2015a. Gene.Available: http://www.ncbi.nlm.nih.gov/gene/ [accessed 10 October 2015]. Google Scholar
- NLM. 2015b. Medical Subject Headings.Available: http://www.nlm.nih.gov/mesh [accessed 10 October 2015]. Google Scholar
- NRC (National Research Council). 2011. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease (2011).Washington, DCNational Academies Press. Google Scholar
- NSF (National Science Foundation). 2015. Research Coordination Networks (RCN) 2015.Available: http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=11691 [accessed 10 October 2015]. Google Scholar
- ORCID. 2015. Homepage.Available: http://orcid.org [accessed 10 October 2015]. Google Scholar
- OWL (Web Ontology Language). 2016. OWL 2 Web Ontology Language, Document Overview (Second Edition).Available: http://www.w3.org/TR/owl2-overview/ [accessed 1 February 2016]. Google Scholar
- Phenoday (Phenotype Day). 2014. Phenotype Day @ ISMB 2014. Joint Bio-Ontologies and BioLINK SIGs Session. 12 July, 2014 - Boston, US. Description.Available: http://phenoday2014.bio-lark.org [accessed 1 February 2016]. Google Scholar
Port JA, Cullen AC, Wallace JC, Smith MN, Faustman EM. 2014. Metagenomic frameworks for monitoring antibiotic resistance in aquatic environments.Environ Health Perspect 122:222-228, doi: 10.1289/ehp.1307009 24334622. Link, Google Scholar Port JA, Wallace JC, Griffith WC, Faustman EM. 2012. Metagenomic profiling of microbial composition and antibiotic resistance determinants in Puget Sound.PLoS One 7:e48000, doi: 10.1371/journal.pone.0048000 23144718. Crossref, Medline, Google Scholar Richesson RL, Nadkarni P. 2011. Data standards for clinical research data collection forms: current status and challenges.J Am Med Inform Assoc 18:341-346 21486890. Crossref, Medline, Google Scholar Schriml LM, Mitraka E. 2015. The Disease Ontology: fostering interoperability between biological and clinical human disease-related data.Mamm Genome 26:584-589 26093607. Crossref, Medline, Google Scholar Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W. 2007. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration.Nat Biotechnol 25:1251-1255 17989687. Crossref, Medline, Google Scholar
- SMW (Semantic MediaWiki). 2015. Homepage.Available: https://semantic-mediawiki.org [accessed 1 February 2016]. Google Scholar
Tenenbaum JD, Sansone SA, Haendel M. 2014. A sea of standards for omics data: sink or swim?J Am Med Inform Assoc 21:200-203 24076747. Crossref, Medline, Google Scholar Tenopir C, Dalton ED, Allard S, Frame M, Pjesivac I, Birch B. 2015. Changes in data sharing and data reuse practices and perceptions among scientists worldwide.PLoS One 10:e0134826, doi: 10.1371/journal.pone.0134826 26308551. Crossref, Medline, Google Scholar van Panhuis WG, Paul P, Emerson C, Grefenstette J, Wilder R, Herbst AJ. 2014. A systematic review of barriers to data sharing in public health.BMC Public Health 14:1144, doi: 10.1186/1471-2458-14-1144 25377061. Crossref, Medline, Google Scholar
- WebProtege. 2015. Protégé Homepage.Available: http://protege.stanford.edu [accessed 1 February 2016]. Google Scholar
- Wikipedia. 2015. Welcome to Wikipedia.Available: https://en.wikipedia.org/wiki/Main_Page [accessed 1 February 2016]. Google Scholar
Wild CP. 2005. Complementing the genome with an “exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology.Cancer Epidemiol Biomarkers Prev 14:1847-1850 16103423. Crossref, Medline, Google Scholar
- Xiang Z, Mungall C, Ruttenberg A, He Y. 2011. Ontobee: A Linked Data Server and Browser for Ontology Terms. Proceedings of the 2nd International Conference on Biomedical Ontologies (ICBO), 28–30 July 2011, Buffalo, New York, 279–281.Available: http://ceur-ws.org/Vol-833/paper48.pdf [accessed 1 February 2016]. Google Scholar
- Youngblood J, Wallace J, Port J, Cullen A, Faustman E. 2014. Metagenomic applications for environmental health surveillance: a one health case study from the Pacific Northwest ecosystem. [email protected] 2(4):281–284.Available: https://planet-risk.org/index.php/pr/article/viewFile/106/221 [accessed 1 February 2016]. Google Scholar
Zimmerman AS. 2008. New knowledge from old data. The role of standards in the sharing and reuse of ecological data.Science, Technology, & Human Values 33:631-652. Crossref, Google Scholar