Response to “Comment on ‘An Informatics Approach to Evaluating Combined Chemical Exposures from Consumer Products: A Case Study of Asthma-Associated Chemicals and Potential Endocrine Disruptors’”

Neither informatics (e.g., Goldsmith et al. 2014) nor gas chromatography–mass spectrometry (GC/MS) (e.g., Steinemann et al. 2011, Dodson et al. 2012, Steinemann 2015) can provide a complete picture of the ingredients in consumer products. Our article demonstrates the complementary nature of these approaches. The larger product sample of the informatics approach gives greater coverage of the formulation space of consumer products. Consequently, it detected the target chemicals in product categories that were missed by GC/MS. However, the informatics approach is limited by the incompleteness of product labels, particularly with respect to fragrance and flavor mixtures. GC/MS can detect chemicals that are not disclosed in product labels and chemicals that are not even part of the product formulation (e.g., chemicals leached from product packaging or degradation byproducts). However, the small sample sizes, typically only a handful of products in a given category, do not reflect the diversity of product formulations.

Excel File Table S1. Two-way chemical combinations Excel File Table S2. Three-way chemical combinations   Excel File Table S3. Four-way chemical combinations Excel File Table S4. Five-way chemical combinations Excel File Table S5. Six-way chemical combinations

Verify that Scraping is Allowed
After confirming that data collection was consistent with the retailer's terms of use and that robotic scraping was not prohibited, consumer product data were collected from Drugstore.com. Drugstore.com's terms of use state: "You agree that your use of robots, spiders, crawlers, wanderers, Web agents and other such automated processes on the Site will be Standard for Robot Exclusion (SRE)compliant robots ("robots") and when connecting to the Site, prior to downloading or indexing any pages on the Site, such robots will immediately visit http://www.drugstore.com/robots.txt ("the robots.txt file"). You understand that the robots.txt file is the only means by which robots are authorized to access the Site. … You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purposes, any portion of the Site…" Scraping is allowed as long as robots comply with the rules in their robots.txt file and scraped data are not redistributed or used for commercial purposes. The robots.txt file provides a sitemap to help robot scrapers navigate the site, a list of disallowed branches where scrapers should not go, and a minimum crawl delay to avoid overwhelming the server with HTTP requests: Disallow: /templates/events/circular.asp Disallow: /4213/edh User-agent: adidxbot Crawl-Delay: 1

Robotic Scraper
The robotic scraper used for this project consists of approximately 130 lines of Java code. It uses the XPath extensions to traverse a retailer's published sitemap, and the Apache HttpClient (version 3.1) to request product webpages. Note that HttpClient is no longer supported. Its functionality has been incorporated into Apache HttpComponents so new development should use this package or some other supported HTTP client (e.g., Jsoup, BeautifulSoup, cURL).
Drugstore.com was scraped in April 2014. Scraping was done on an HP SL390G7 server with two 2.66GHz Intel Xeon X5650 processors and 96GB memory. The operating system was Scientific Linux 6.1 (Linux 2.6.32 kernel). Scraping is network-limited rather than compute-or memory-limited so a powerful server with specialized hardware is not necessary. A reliable network connection and sufficient disk space are more important. Scraping Drugstore.com took approximately three days at a two-second crawl delay. Their robot exclusion protocol specified a one-second crawl delay but this was doubled to put less strain on their servers.

Extract the Requisite Information from the Raw HTML
Brand and product names, ingredient list, and product category are needed for this analysis. This information is available on most Drugstore.com product pages and can be extracted from the raw HTML retrieved by the robot scraper. This is done by finding tags that consistently mark the desired information across a given retail site. For example, the "TblProdForkIngredients" tag indicates the location of the product ingredient list in Drugstore.com product pages.
The first occurrence of the "s.prop5" and "<title>" tags indicate the brand and product names, respectively, and the "home<" tag indicates the retail hierarchy for product categorization (e.g., home à personal care à oral care à mouthwash). These tags vary by retailer but once identified are consistent and reliable across a given retailer's product pages. Frequent spot checks of random samples are used to refine each stage of data processing.
Validation of brand and product names was performed by manual inspection of 100 randomly selected products to confirm that the necessary data was correctly extracted from the raw HTML. Accuracy was 100% (i.e., every brand and product name in the sample was correct).
Category assignments were similarly validated using a random sample of 100 products. Accuracy was high (96%). Of the four incorrectly categorized products, one was due to an error in the retail hierarchy; specifically, an eyeliner product was incorrectly placed in the lip liner branch of the sitemap. The rest were due to ambiguities in category mapping. For example, one of the incorrect assignments was a topical medication in a relatively sparse branch of the retail hierarchy: medicine & health à pain & fever relief à shop by active ingredient à natural ingredients. The most specific level of the retail hierarchy that maps to one of our product categories is "pain & fever relief" so it used to make the assignment, as stated in the article. In our categorization scheme, "pain & fever relief" maps to oral medications because most products in this category are oral medications.
A combination of Python (version 2.7.3), regular expressions, grep, and the html2text utility were used to process the raw HTML product pages. Extracting the brand names, product names, and product categories was straightforward but extracting the ingredients required more finesse because there is no standard format for ingredient lists. Most product labels provide a simple, comma-delimited list of ingredients. However, some lists contain non-ingredient text, active concentrations, and parenthetical information that may or may not be useful, e.g.: active ingredients: avobenzone -2 % (sunscreen), homosalate (15%), octisalate (5%) (sunscreen), oxybenzone -4 % (sunscreen) inactive ingredients: alcohol denat, acrylates, octylacrylamide, glycerin, aloe barbadensis leaf extract, tocopherol (vitamin e), cocos nucifera oil (coconut), mineral oil, fragrance Simply processing this string as a comma-delimited list will result in noisy ingredient names that are more difficult to match to chemicals. However, patterns in such strings inform a multistep text processing algorithm that yields a clean list of ingredients for most product label formats.
Step 1: Remove "active ingredients:˽" (the ˽ symbol denotes a single space) and replace "˽inactive ingredients:˽" with a comma.
Step 4: Extract active concentrations from the ingredient strings using the regular expression below. Note that active concentrations are specified in percentages, milligrams, or units. Active concentrations are not used in the present analysis but they are retained for future use. Step 5: Extract parenthetical text using the regular expression below. Parenthetical text often contains information that can help identify chemical ingredients (e.g., "vitamin e" in this example) so it is retained. Any leftover trailing punctuation is also removed in this step to yield a final, clean list of ingredient names. The ingredient string processing algorithm was validated by randomly selecting 100 products for manual inspection. Parsed ingredient lists were compared to the raw ingredient strings to confirm that ingredient names and accompanying parenthetical text are correctly extracted. Of the 1587 ingredients in this sample, 1547 (97%) were correctly extracted. Of the 40 incorrectly extracted ingredients, 24 were slash-delimited polymers, fatty acids, or mixtures (e.g.: styrene/acrylates copolymer, acrylates/c10 30 alkyl acrylate crosspolymer, cetyl peg/ppg-10/1 dimethicone, caprylic/capric triglyceride, pvm/ma copolymer). The ingredient string processing algorithm was not modified to handle these types of ingredients because they are not the focus of the present analysis and because it is unclear how they should be parsed. Missing commas in the ingredient list caused the remaining 16 incorrectly parsed ingredients.

Remove Duplicate Products
Duplicate products can appear in the database for several reasons. The same product can appear in different branches of a retail sitemap. The same product may be sold in different sizes. In future, as more retail sites are scraped and added to the database, product inventories may overlap, leading to duplicate entries. Pruning duplicates is necessary to get accurate counts of products and ingredients, but identifying duplicate products is not always as straightforward as matching product names under the same brand because typographical errors and differences in punctuation can mask duplicates, e.g.: Unfortunately, digital text contains typographical errors just like printed text. If these two products have identical brands and ingredient lists, they are likely the same product scraped from different locations. Alternative word orders in product names can also mask duplicate products, e.g.: It is harder to identify these two products as duplicates because the words, word order, and punctuation are different. However, if they also have identical brands and ingredient lists, they are likely different representations of the same product. Applying a spelling checker to fix typographical errors, removing punctuation, and doing string matching on the product names will find many duplicate products but it will not find duplicates when the word order of the product names differ. Dice's coefficient (Dice 1945) is a better way to compare product names in this case: If two product names have a high Dice coefficient, they are not necessarily the same product because formulations change. Their ingredient lists must still be compared. Labeling regulations dictate that ingredients be listed in descending order of predominance so word order matters when comparing ingredient lists. Therefore, Levenshtein ratio (Navarro 2001) is a better way to measure ingredient list similarity. It is computed as follows: Edit distance (Navarro 2001) was computed using the edit_distance function in the Natural Language Toolkit. The algorithm to find duplicate products is as follows: The algorithm was tuned and validated using a manually curated sample. A random sample is unlikely to contain duplicate products so ten brands with ten products each were selected and manually analyzed for duplicates. The sample contained 89 distinct and 11 duplicate products. Dice coefficient and Levenshtein ratio thresholds of 0.85 and 0.9, respectively, gave the best results, correctly identifying 9 out of 11 duplicates with no false positives.

Load the Product Data into a Structured Database
The final processed data are loaded into a structured database, in this case Oracle Database 11g Enterprise Edition (release 11.2.0.1, 64-bit production build). The following screenshot shows an example product (Biotene Oral Balance, Dry Mouth Moisturizing Gel) as it appears in the database: Each product is assigned a unique ID that is the primary key to access any data related to the product. The product data for the present analysis reside in two tables: one for the product details (CPDB_PRODUCT) and the other for ingredients (CPDB_PRODUCT_INGREDIENT). Other tables hold active concentrations, parenthetical information from the ingredient lists, size and price information, and textual information pertaining to the product, but they are not used in the present analysis. Note that the retrieval date (the DATERETRIEVED field) of each product is stored to help track reformulations of the same product. The order of ingredients on the product label (the INGREDIENTRANK field) is also stored because it can indicate relative predominance in the formulation. Finally, multiword ingredients (hydroxyethyl cellulose and sodium hydroxide in the example) are split into separate records (see the TERMID field) to facilitate matching with the chemical dictionaries chemicals.
The UMLS (Humphreys and Lindberg 1993;Humphreys et al. 1998) is comprised of three components, the SPECIALIST lexicon, semantic network, and a metathesaurus that aligns the content of 170 different independently maintained controlled vocabularies: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/notes.html. The terms in these vocabularies are mapped to Concept Unique Identifiers (CUI). The UMLS can also be downloaded from the National Library of Medicine: http://www.nlm.nih.gov/research/umls. Terms in the UMLS were preprocessed using a process similar to .
The following screenshot shows the first few PubChem synonyms of sodium hydroxide as they appear in the database: The UMLS tables have a similar structure except the unique identifier is a CUI instead of a CID. Each synonym is identified by a CID and a PHRASEID. Multiword synonyms (e.g., sodium hydroxide) are split into individual terms that are given a TERMID. If sodium hydroxide appears in a product ingredient label, it will be mapped to CID 14798 whether it appears as sodium hydroxide, caustic soda, or soda lye.

Match Ingredient Names to PubChem and UMLS Synonyms
Ingredient names are matched to PubChem and UMLS synonyms using exact term-by-term matching. One-term ingredient names (e.g., glycerin) are simply compared to one-term PubChem synonyms and one-term UMLS concepts, two-term ingredients (e.g., sodium hydroxide) are compared to two-term synonyms/concepts, etc. If a match is found the ingredient is mapped to the CID and/or CUI. Exact matching was used for three reasons. First, as noted above, systematic names are rare in product ingredient labels so complex matching schemes are generally unnecessary. Trivial names are easily parsed into terms that can be matched exactly. Second, PubChem and UMLS entries often have dozens, sometimes hundreds, of synonyms, so a trivial name appearing in a product ingredient list is likely to be among those synonyms. Third, string matching techniques that use Dice's coefficient, edit distance, and Levenshtein ratio are prone to false positives and false negatives when dealing with chemical names. For example, "vitamin a" and "vitamin e" have a high Levenshtein ratio but are different chemicals (false positive), whereas "dimethyl ether" and "methoxymethane" have a low Levenshtein ratio but are the same chemical (false negative). A dictionary-based approach using exact term-by-term matching is therefore the best method to map an ingredient name to a chemical identifier.