Into the Black Box: What Can Machine Learning Offer Environmental Health Research?

Figure showing three different training data sets in the first column, followed by representations of how well 10 different machine-learning algorithms interpret each data set. The algorithms are nearest neighbors (93–97% accuracy, depending on the data set), linear support-vector machine (40–93% accuracy), radial basis function kernel support-vector machine (88–97% accuracy), Gaussian process (90–97% accuracy), decision tree (80–95% accuracy), random forest (80–95% accuracy), neural network (88–95% accuracy), AdaBoost (82–95% accuracy), naïve Bayes (70–95% accuracy), and quadratic discriminant analysis (72–93% accuracy).


A Machine-Learning Tutorial
The driving force behind AI is machine learning, which refers to how computer algorithms improve at performing assigned tasks with increasing experience. 9 One way they do that is by learning to recognize patterns in data. Training in pattern recognition can be either supervised (coached by humans) or unsupervised, meaning the algorithms are turned loose on data to identify patterns on their own.
Supervised algorithms must first be trained with labeled data sets that show them how to recognize, for instance, a cat in a digital photo, a gene in a DNA sequence, or the likely price of a home in a given neighborhood. Depending on the underlying nature of the data, the algorithms' predictions arrive in one of two categories: either a discrete classification (such as "cat" or "gene") or a regression category in which the prediction describes a measured value within a continuous variable (such as "price").
Unsupervised algorithms self-organize data without any such guidance. With a common technique called cluster analysis, for instance, these algorithms automatically sort data with similar features into groups. Because scientists might not know to look for those data groupings in advance, cluster analysis can lead to new and unanticipated discoveries.
An even more powerful subclass of machine learning, called deep learning, relies on layers of algorithms that are arranged to mimic the architecture of the human brain. 10 Convolutional neural networks (CNNs), for instance, are deep-learning models inspired by the arrangement and functioning of the human visual system. CNNs are at the core of most computer vision applications todaysuch as Facebook's automated photo-tagging system or the interpretation of remote sensing data.
But there are many other types of deep-learning models. A recurrent neural network is a model that's particularly good at finding patterns in time-series data, meaning data sets that change over time (think stock market prices or fluctuations in ozone concentrations during the day). Yet another kind of deep-learning model, called an autoencoder, is used for unsupervised machine learning and can be applied to reconstruct complete digital images and other data representations from minimal sets of key information. In some cases, they are used to filter out extraneous "noise," which is useful for sharpening digital images. Selecting the right model for the job is crucially important, though it is not always obvious which one to pick. "One of the most common questions I get is, 'What sort of model should I use with my data?' " says Marianthi-Anna Kioumourtzoglou, an assistant professor at Columbia University's Mailman School of Public Health who uses AI in health studies of chemical mixtures. The answer, she says, is that researchers should start by clearly framing the question they want to answer.
Environmental Health Perspectives 022001-1 128(2) February 2020 This figure shows three data sets (rows) with two variables (x-and y-axes) and two outcomes (blue or red). Each of the 10 machine-learning methods (columns) attempts to classify dots as blue or red by building mathematical functions of the x and y variables. The shades of color reflect the confidence of the model in classifying each dot as blue or red. The numbers represent the classification accuracy or the proportion of dots correctly assigned a blue or red outcome by the model. Each method detects different patterns and performs differently on each data set. One of the challenges of machine learning is knowing which method is the best choice for a specific data set. Image: 3-Clause BSD License. Figure created using the Scikit-learn library. 18 David Carlson, an assistant professor of civil and environmental engineering at Duke University, says the risk to avoid is "overfitting," or the tendency of a poorly chosen model to capture noise in the data instead of real information. In these cases, the model will generate unreliable predictions, whereas well-chosen models, he explains, will be generalizable. In other words, a well-chosen model will be capable of adapting properly to new data that it has never seen before. Scientists can apply several statistical tests to validate their models so they can be more confident in the models' generalizability.
Although model selection involves specialized skills in statistics and computer science, some researchers are also turning to a growing number of open-source software packages that will fit models to their data automatically. At a 2019 conference devoted to AI for environmental health, 11 UPenn's Moore described one such package called PennAI, which was developed by his own research team. "You just load your data set, push a button, and the AI takes over and launches what it thinks is the best model to run," he said. According to Moore, PennAI is able to do this because it creates a knowledge base of which models tend to work with different kinds of data, similar to the systems that commercial entities such as Amazon use to suggest items you might want to buy based on your shopping history.
In other words, Carlson explains, PennAI and many other packages are intended to make AI tools more accessible to a broad audience. But, he adds, while such tools are more accessible than ever, "I personally do not think that such systems are there yet, and you need a lot of expertise and understanding to use and correctly interpret the output of such a system." The Artificial Intelligence Landscape in Environmental Health Today As AI moves into environmental health research, near-term opportunities for the technology are arising on several fronts. Text analytics (also known as text mining) uses machine-learning algorithms to extract useful information from papers and reports. This "is a big area of interest for us," says Jerry Blancato, director of the Office of Science and Information Management at the U.S. Environmental Protection Agency (EPA). Blancato says text analytics will ideally allow for better ways to manage, query, and categorize data from different sources.
According to Paul Whaley, a research fellow at Lancaster University in the United Kingdom and the Evidence-Based Toxicology Collaboration in the United States, advances in text analytics will go a long way toward boosting the efficiency of systematic review, a highly methodical process by which scientists collect evidence from multiple sources that can help answer certain questions. As it stands now, systematic reviews rely heavily on research associates who have to read through hundreds or even thousands of documents. Whaley says the EPA and the NIEHS have both begun to automate these initial screenings with machine-learning algorithms that classify the documents according to keywords in titles or abstracts.
More complex text analytics may eventually allow algorithms to read and comprehend entire sentences, although these programs do not yet have the necessary rich, granular understanding of language. "That's the sort of capability we're really looking for," Whaley says. "Classifications are helpful, but more than that, we need machine-learning systems that can read through the reports and extract relevant information for us. That way, instead of manually extracting data from, say, twenty-five reports, you could automatically pull it from thousands of potentially useful documents and wind up with much larger, richer data sets than can be assembled manually." Whaley adds that an important step in that direction would be to assemble a "full-text corpus" of annotated studies that could be used to train algorithms to read technical language more effectively. A full-text corpus is a set of documents within which the important information has been highlighted or tagged by hand. According to Whaley, algorithms trained on such a knowledge base will learn to identify and extract similar information when they are exposed to it to in other documents later.
At the NTP, researchers are using analogous methods with an eye toward developing computerized systems for predicting chemical toxicity. Toward that end, Kleinstreuer's group and researchers at Oak Ridge National Laboratory are jointly developing algorithms that, as a first step, will identify high-quality papers in the toxicology literature. During this initial process, reviewers have to read the studies and then extract information on, for instance, the protocols, types of chemicals tested, and observed effects. The aim is to use the information in those papers as source material for databases that relate chemical structures to toxic end points such as mortality, endocrine disruption, and protein reactivity, among others. In turn, these databases can be used to train machine-learning models used by other teams investigating chemical safety.
Assembling the databases requires that NTP researchers put the published information into machine-readable formats that computer algorithms can work with. "A lot of what we're doing now is brute force curation to digitize studies that are not computationally accessible," Kleinstreuer says. She adds that NTP researchers recently curated a database of rodent LD 50 values (which describe the dose that kills 50% of a group of exposed animals) associated with approximately 15,000 chemical structures. Kleinstreuer says that as model development evolves, the entire process-from selecting papers, to curating databases, to finally developing algorithms that The open-source PennAI software aims to simplify machine learning for the user. On the launch screen (top), users choose from available data sets to perform analyses. The "Best Results" box shows which algorithm has performed most accurately on each data set. The user can also browse all the results for each data set by clicking on the "experiments completed" box. From here, the user has the option of toggling to the "AI" option to let the software automatically choose an appropriate machine-learning algorithm and parameters. Or, on the "Build New Experiment" screen (bottom), the user can manually select algorithms and parameter settings. Image: Courtesy Jason Moore. predict toxic effects from exposure to untested chemicals-could in time be accomplished with AI.
Applying machine learning to fieldand satellite-based remote sensing data is yet another emerging development. At the EPA, scientists are using the technology to map floodplains and mosquito habitats and to develop predictive models that warn of toxic algal blooms. Elsewhere, other researchers are using it to estimate air pollution levels. One of these scientists is Scott Weichenthal, an epidemiologist at McGill University. During a recent project, Weichenthal's team found that when applied to satellite imagery, CNNs predicted concentrations of fine particulate matter (PM 2:5 ) with nearly the same accuracy as a model used by the World Health Organization (WHO) to assess air quality for its Global Burden of Disease study. 12 The WHO's model, which is called the Data Integration Model for Air Quality, relies on many different inputs, such as chemical transport features and pollution measurements gathered from sensors on the ground. Weichenthal and his colleagues trained their model by pairing ground-level sensor data from approximately 6,000 sites in 98 countries with corresponding satellite data for each sensor location. Once trained, the model could predict variation in PM 2:5 levels solely based on land-based features, "and all you need to run it is the satellite picture," Weichenthal says.
Building on this approach, Francesca Dominici, a biostatistician and codirector of the Data Science Initiative at Harvard University, has related machine learning-derived estimates of airborne PM 2:5 concentrations to changes in mortality among older Americans. 13 For that effort, she and her colleagues relied on a model 14 that combined ground-and satellite-based measures and applied machine-learning algorithms to the data to estimate pollution levels at the square-kilometer level throughout the United States. They paired the predicted values with data from millions of Medicare claims, gathered from each zone between 2000 and 2012. Their analysis indicated that increases of 10 lg=m 3 in PM 2:5 and 10 ppb in ozone were associated with increases in allcause mortality of 7.3% and 1.1%, respectively. 13

Issues of Trust
Still, Dominici describes the modeled PM 2:5 predictions as guesses, adding, "We're not there yet in terms of quantifying how good the guesses from machine learning are." That is especially true when the predictions come from black boxes that, as she says, breed uncertainties "that we cannot afford to ignore when we're estimating health effects." Weichenthal agrees that the technology isn't without its shortcomings. He acknowledges that the estimates in his work became increasingly unreliable outside the areas where the model was initially trained. Moreover, given that the model's internal calculations are somewhat opaque, the specific features of the built environment that drive its predictions are not known.
In an especially egregious circumstance of poor guesswork that came to light during the 2018 California wildfires, Google used a proprietary black-box machine-learning algorithm from another company to power its search page weather widget. The widget claimed air pollution levels were safe, 15 even as people in the area were watching the ash build up on their cars. 8 According to Carlson, computer scientists are currently experimenting with ways to open up deep neural networks and other black boxes to expose their internal calculations or to produce interpretable models with comparable accuracy. Meanwhile, any model's accuracy depends, in large part, on the quantity and quality of the data to which it is exposed and differences between training data and real-world data. Carlson wrote in 2019 that these accuracy-altering differences "can cause significant problems for machine-learning methods." 8 Carlson claimed that "modifying a single pixel can completely alter an algorithm's understanding of an image" and a small decal stuck to a stop sign "can fool even a modern industrial computer vision system for self-driving vehicles." 8 Making sure that machine-learning algorithms used in environmental health have sufficient access to high-quality data is now a priority for the field. "Nothing in AI is going to work if you do not pay attention to data quality," says Woychik, adding that the NIEHS is highly focused on developing sustainable systems for generating data that can be easily shared with researchers around the world. Fundamental to that goal, he says, is that data production abides by the FAIR Guiding Principles, which were first published in 2016. 16 Those principles state that data, and associated data objects such as code, should be findable, accessible, interoperable, and reusable by humans and machines alike.
Toward that end, the NIEHS is currently overhauling its cyberinfrastructure to better prepare for AI uses. The institute has hired new staff tasked with assembling a plan for cyberinfrastructure management, including better ways to collect, annotate, and archive data for ongoing and future use. 17 Once those systems are in place, "we can think about ways to do more complex experiments with AI," Woychik says, "but without overpromising on the potential when so much is still speculation." Similarly, officials at the EPA recently established a formal steering committee that's become a gathering point for people interested in AI who want to provide training, advice, or consultation. "We have many people with deep expertise, and we're Investigators including Frederica Dominici developed a machine-learning model to predict PM 2:5 concentrations across the United States. The model incorporates remotely sensed data, estimates of ground-level PM 2:5 and total atmospheric aerosols, meteorological data, land use data, and more. The training set (top) was based on monitoring data from the U.S. Environmental Protection Agency's Air Quality System. The model produced an image (bottom) that closely mirrors the ground-truthed data but offers a finer spatial scale. Image: Courtesy Benjamin M. Sabath. looking to share the wealth and build up collaboratives," says the EPA's Blancato.
Adams at RTI agrees that most of the current environmental health focus is still on preparing data for use by machine-learning algorithms. "Facebook and other companies are successful [in] doing this because they are working with terabytes of data," he says. "The rest of us doing science are still investing resources to label data and make it available for people to use. And what we can do with the technology [depends on] how well we integrate the data we collect."