The Sister Study Cohort: Baseline Methods and Participant Characteristics

Background: The Sister Study was designed to address gaps in the study of environment and breast cancer by taking advantage of more frequent breast cancer diagnoses among women with a sister history of breast cancer and the presumed enrichment of shared environmental and genetic exposures. Objective: The Sister Study sought a large cohort of women never diagnosed with breast cancer but who had a sister (full or half) diagnosed with breast cancer. Methods: A multifaceted national effort employed novel strategies to recruit a diverse cohort, and collected biological and environmental samples and extensive data on potential breast cancer risk factors. Results: The Sister Study enrolled 50,884 U.S. and Puerto Rican women 35–74y of age (median 56 y). Although the majority were non-Hispanic white, well educated, and economically well off, substantial numbers of harder-to-recruit women also enrolled (race/ethnicity other than non-Hispanic white: 16%; no college degree: 35%; household income <$50,000: 26%). Although all had a biologic sister with breast cancer, 16.5% had average or lower risk of breast cancer according to the Breast Cancer Risk Assessment Tool (Gail score). Most were postmenopausal (66%), parous with a first full-term pregnancy <30y of age (79%), never-smokers (56%) with body mass indexes (BMIs) of <29.9 kg/m2 (70%). Few (5%) reported any cancer prior to enrollment. Conclusions: The Sister Study is a unique cohort designed to efficiently study environmental and genetic risk factors for breast cancer. Extensive exposure data over the life-course and baseline specimens provide important opportunities for studying breast cancer and other health outcomes in women. Collaborations are welcome. https://doi.org/10.1289/EHP1923


Introduction
Breast cancer is the leading (non-skin) cancer in U.S. women, with over 240,000 diagnoses of invasive breast cancer and 40,000 deaths estimated to have occurred in 2016 (SEER-NCI 2016). As the U.S. population ages, and more women enter the decades with the highest breast cancer incidence [median age at diagnosis = 62 y of age (SEER-NCI 2016)], these numbers are expected to rise. Both invasive and in situ breast cancer can lead to significant morbidity and health care resource utilization (Feiten et al. 2014;Fontes et al. 2016;Scott et al. 2016;Tian et al. 2013). Known risk factors explain little of the variation in breast cancer risk, and heritability is modest (Ford et al. 1995;Mucci et al. 2016).
Responding to public concerns, we proposed a novel approach to the study of environment and breast cancer. At that time, there were already many large U.S. cohort studies of women's health generally or breast cancer specifically (Belanger et al. 1978;Colditz and Hankinson 2005;Hays et al. 2003;Russell et al. 2001;Women's Health Initiative Study Group 1998). Although these studies collected breast cancer incidence data, many did not collect biological samples from the full cohort and few focused on non-lifestyle environmental factors. Large population-based case-control studies (Gammon et al. 2002) did focus on environmental exposures, collecting both environmental data and biological samples, but retrospective studies are subject to bias in assessing exposures or biologic measurements that might change following breast cancer diagnosis or treatment. Thus, we saw the need for a large prospective study focused on environmental and genetic drivers of breast cancer risk, a necessity for studying a disease with complex etiology and potentially long time course between relevant exposures and clinical disease (Swerdlow et al. 2011;Weinberg et al. 2007). The Sister Study, a prospective study of 50,884 U.S. women who have had at least one sister diagnosed with breast cancer but had no personal history of breast cancer at enrollment, was designed to fill this gap. The study was not designed around one particular a priori hypothesis. The primary objective was to create a resource from which to study current and emerging hypotheses regarding environmental and genetic risk factors for breast cancer. By collecting data on a wide range of potential risk factors, including commonly studied and novel exposures, along with environmental and biological samples, we hoped to create a framework from which to think more broadly about environmental causes and gene-environment interactions. The approach was premised on the general paradigm that by studying genetic and environmental factors in a cohort of women with enhanced risk we would be much more likely than in previous studies to identify preventable risk factors for breast cancer (Weinberg et al. 2007). There were a number of specific and broadly defined environmental factors of interest at study initiation, including vitamin D, light at night, hormone replacement therapies, diet, pesticides, solvents, air pollution, personal care products that may contain endocrine disruptors, environmental tobacco smoke, organochlorines, and exposure to medical hyperstimulation of the ovaries (see Table  S1). The prospective design allows us to assess these and other exposures before disease onset, avoiding biases inherent in casecontrol studies. In addition, it was recognized that hypotheses of interest at the time the study was initiated might no longer be of interest by the time sufficient cases accrued. Thus, the prospective design with ongoing data collection also creates a framework for addressing future hypotheses as science advances over the followup period (see Table S1), and for studying health outcomes other than breast cancer.
Assembling and following a cohort of women who have had a sister diagnosed with breast cancer provided two advantages. The sisters of women with breast cancer are at approximately twice the risk of breast cancer as the general female population (Collaborative Group on Hormonal Factors in Breast Cancer 2001), allowing faster case accrual than in similar sized cohorts that are not enriched by family history. The increased risk in sisters could be due to enhanced genetic susceptibility, shared environmental risk factors, or some combination of the two. The prevalence of multiple gene variants related to breast cancer is expected to be higher in a cohort recruited based on sister history (Weinberg et al. 2007). This may confer increased power for finding environmental factors that interact with genetic factors, as demonstrated mathematically by Weinberg et al. (2007) and illustrated in a recent analysis of polycyclic aromatic hydrocarbon (PAH) exposure, familial risk, and breast cancer (Shen et al. 2017). The prevalence of relevant environmental exposures may also be increased to the extent that sisters share similar experiences, further enhancing statistical power (Weinberg et al. 2007).
Second, in-depth information on exposures over the lifecourse collected at baseline, along with high follow-up rates over time, are critical to the success of any long-term study. Sisters of women diagnosed with breast cancer potentially provide a very motivated and engaged cohort, enhancing retention and allowing us to collect data on commonly studied factors such as reproductive history, hormone use, and diet as well as less well-studied occupational and environmental exposures.
The Sister Study cohort has now matured to the point where there are sufficient cancer and non-cancer outcomes for etiologic studies. The purpose of this paper is to describe the study methods, which may be useful to others planning new cohorts, and to describe the baseline characteristics of Sister Study participants. Future publications will compare cohort participant characteristics to those of women in the National Health and Nutrition Examination Survey (NHANES), a nationally representative sample of United States, as well as provide additional details on the Sister Study biorepository.

Methods
The Sister Study is a long-term prospective cohort of women residing in the United States (including Puerto Rico) who have had a sister diagnosed with breast cancer but did not have breast cancer themselves at enrollment. Interest in risk factors for breast cancer drove design decisions, but the cohort is also appropriate for studies of other cancer and noncancer health outcomes and, through extended follow-up of all participants, for studies of cancer survivors.
The institutional review board (IRB) of the National Institute of Environmental Health Sciences and the Copernicus Group IRB approved the study. All participants provided written consent. Data included in this report come from Sister Study Data Release 5.0.1 (August 2015), unless otherwise noted.

Eligibility
Women residing in the United States, including Puerto Rico, were eligible for the Sister Study if they were 35-74 y of age, had a sister (full or half) diagnosed with primary breast cancer, and had not themselves ever had a diagnosis of ductal carcinoma in situ (DCIS) or invasive breast cancer. A history of cancer other than breast was not considered a basis for exclusion. Women with a prophylactic mastectomy were considered eligible given that they are still at risk for breast cancer, albeit very low risk, and preventive surgeries were documented. Special efforts were made to maximize inclusion of typically underrepresented women including nonwhite women, older women, and women of lower socioeconomic status.

Recruitment
Women were recruited in a "Vanguard" run-in phase in selected cities (Phoenix, AZ, Providence, RI, Tampa, FL, and St. Louis, MO) beginning in July 2003. In 2004, recruitment was expanded from the four pilot cities to the four states in which each city was located. Then, following a national press release in October 2004, recruitment was extended to all 50 U.S. states and Puerto Rico. Enrollment of women in most demographic groups ended 1 April 2008, but continued through March 2009 for underrepresented groups (i.e., African American, Latina, Asian American, Native American, less than a college degree, ≥65 y of age). To enroll a diverse group of women with different educational levels, job exposures, and ages, recruitment was multifaceted, ranging from community-based local efforts to nationally endorsed campaigns, as described in detail in Appendix A in the Supplemental Material.
Because there are no lists of women with a sister with breast cancer, recruitment targeted women more broadly, including the general population and breast cancer survivors who could lead us to eligible women. The approaches most often used were a) word-of-mouth and flyer distribution through breast cancer support and advocacy groups, women's volunteer organizations, enrolled Sister Study participants and contacts made at local and national women's events; b) outreach through hospitals, mammography centers, churches, unions, and trade organizations; and c) direct mail, mass emails, and media outlets (television, web, radio, newspapers, and magazines).
Recruitment materials included brochures and flyers in English and Spanish, some with tailored messaging for women in trades, older women, and women of various races and ethnicities. A brief video featuring early study participants and their sisters was also distributed. Giveaways such as fans, pins, notepads, and logo magnets with contact information also helped promote the study. These materials were provided to recruitment staff, volunteers, and organizations, along with study talking points and sample newspaper articles for local media.
A direct mail and email campaign also targeted minorities, seniors, and women in trades. For example, postcards were mailed to African-American women using a purchased list of confirmed addresses, and emails were sent to women receiving Essence, People en español, Blacks N LA, and Las Comadres, and to a commercial list of Asian-American women.
We worked with a wide range of breast cancer and minority advocacy organizations to promote the study (see Appendix A in the Supplemental Material). The Sister Study principal investigator (PI) promoted the study in presentations to women's groups, minority health organizations, and other groups, seeking input on design and research questions. Sister Study recruiters distributed study materials at local and national conferences and trade shows (e.g., the National Hair Show in Atlanta, Georgia) and made materials available to local volunteers, partner organizations, hospitals, mammography centers, minority-focused groups, and breast cancer support groups. The Dr. Susan Love Research Foundation selected the Sister Study as the first study for which the Love/Avon Army of Women (https://www.armyofwomen.org/) would help recruit participants.
We used free and paid media as well as celebrity endorsements. We distributed press releases and media kits including talking points, ads, and newsletter text. Brief mentions and feature articles appeared in magazines such as Woman's Day, Ladies Home Journal, Essence, and People. Articles in AARP magazines and bulletins (English and Spanish) reached older women. The study PI and participant volunteers appeared live and via remote satellite on local and national morning news programs and conducted radio interviews and media tours. Radio public service announcements in English and Spanish also were distributed. Radio campaigns featured on-air mentions by radio station personalities. Novel approaches included billboards and bus ads in selected cities and a media campaign with Reach Media's Tom Joyner Morning Show, a popular national radio show aimed at the African-American community, which included live on-air mentions by Tom Joyner, on-air interviews, and a web campaign on his Black America Web site.

Enrollment
Interested women completed a brief eligibility questionnaire via website or telephone call. Women found to be eligible through website screening were asked to confirm interest by making a telephone call to the study. Those who agreed to enroll were mailed study kits containing self-administered questionnaires, consent documents, support materials for the telephone interview and home visit, and supplies and instructions for collecting urine, toenail, and house-dust samples. A follow-up interview confirmed kit receipt, reviewed materials, and explained what the participant could expect at the home visit. Eligible women had to complete a two-part computer-assisted telephone interview (CATI1 and CATI2) and a home visit to be considered enrolled in the cohort. Women who signed up for the study but did not complete required baseline activities are being followed for mortality and cause of death via the National Death Index (NDI), provided they completed at least one telephone interview.
Data collection. An overview of baseline Sister Study data collection is provided in Table 1. This table includes data collection components and their corresponding details.
Questionnaires. Trained interviewers administered the two-part telephone interview (CATI1 and CATI2) in either English or Spanish. The interview, which took about 2 h to complete overall, collected information on breast cancer risk factors, residential history, medical history, lifetime occupational history, reproductive history, socioeconomic status, and other information, including sister history of breast cancer (https://sisterstudy.niehs.nih. gov/english/baseline.htm and Table S1). The questionnaires were longer than those in other cohort studies to allow for collection of information on commonly studied known and potential risk factors as well as to collect data on occupational and environmental exposures that were not being collected in most other prospective studies.
Environmental and occupational exposures of interest included but were not limited to chemicals previously identified as mammary carcinogens or endocrine disruptors (Bennett and Davis 2002;Rudel et al. 2007) and shift work; we asked about history of working in industries and occupations where exposure to these factors was possible as well as exposures at home, such as pesticides, paints, or hobby materials, and gardening. In addition to the time of enrollment, questions focused on periods that may be relevant to breast cancer risk, including in utero and childhood exposures, particularly around menarche. Addresses for current, longest adult, and longest childhood residence have been geocoded for linkage with various GIS databases for environmental exposures, such as air pollution, and census data for socioeconomic and neighborhood factors. This questionnaire was included in mailed kit to be filled out for the 24 h just before the examiner visit; completed questionnaire given to home visit examiner (phlebotomist). Data collected: medications, smoking, alcohol, and chemical exposures in the 24 h before blood collection; environmental exposures in the weeks preceding blood collection. Other questionnaires (self-administered; paper) • Family history • Food frequency (Block 98) • Personal care products Included in the study kit mailed to participants; completed questionnaires usually given to home visit examiner but could be returned later by participant. Data collected: Family history questionnaire: participants' birth characteristics and mother's pregnancy experiences (in utero exposures); cancer history of first-degree relatives and others; noncancer medical conditions in biological family members. Dietary questionnaire: frequency and amount of foods consumed in the last 12 months; meal patterns; complementary and alternative medicines. Personal care products questionnaire: current and childhood (10-13 y of age) use of products such as makeup, moisturizers, other creams and lotions, acne-related products, skin lighteners or tanners, wrinkle-reducing products, talc, douches, hair care products and dyes, nail care products, mouthwash, deodorant, antiperspirant.
Participants completed self-administered questionnaires on diet, personal care products, family history of cancer, and earlylife exposures, including the participant's mother's exposures during her pregnancy with the participant. The food frequency questionnaire (Block 98) (Boucher et al. 2006) was supplemented with questions about cooking practices, dietary intake of phytoestrogens, childhood diet, vitamin supplements, and complementary and alternative medicines and practices.
Home visit. During a 45-min visit to participants' homes (or in rare instances another site such as a doctor's office), female examiners from a national in-home phlebotomy service [Examination Management Services, Inc. (EMSI)] collected fasting blood samples, anthropometry data (height, weight, and waist and hip circumference), and blood pressure following standardized study protocols. Ahead of the home visit, participants were sent detailed instructions and materials for all specimen collections: first morning void urine ( ∼ 60 mL phthalate-free cup), toenail clippings (all; polish-free), and house dust (six alcohol wipes and plastic bag; two wipes each for the tops of three door frames in different rooms). Written consent was obtained prior to collecting biological samples. Participants were asked to record the date of their last menstrual period, the use of medications and hormones, smoking, and alcohol in the 24 h prior to the home visit, and information about the self-collected urine, dust, and toenail samples. Examiners retrieved self-administered questionnaires and participant-collected biological and environmental samples.
Examiner-drawn blood samples totaled ∼ 45 mL and were collected in six Becton Dickinson Vacutainer ® tubes. These included two EDTA tubes (one purple top, one metal-free tan top), two red-top serum tubes, and two yellow-top ACD-B tubes. Red-top tubes were centrifuged in the field and serum and clot separated prior to shipping. Serum was transferred to an amber tube to protect from sunlight. Using custom-designed multi-compartment Styrofoam packaging, urine and serum were shipped cold, and whole blood and clot were shipped at ambient temperature to a central laboratory (Social & Scientific Systems, Inc., Durham, NC) for processing and storage. In the event of unsuccessful blood collection, saliva was collected for DNA analyses (Oragene DNA self-collection saliva kit; DNA Genotek, Ottawa, ON, Canada). All samples were barcoded with participant ID prior to shipping.
Specimen processing. Upon receipt at the central laboratory, any evident adverse conditions and examiner errors were documented. Kit contents were scanned and inventoried, and daily reports on examiner performance were fed back to EMSI; 92.6% of kits were received at the central laboratory within 24 h of collection.
Serum was stored in 0:5 mL CryoBioSystem™ (CBS™) straws in liquid nitrogen (LN) vapor phase. Blood clots were stored in −80 C freezers and LN vapor phase. EDTA whole blood was stored in a cryovial, and spotted (60 lL per spot) and stored on two types of dry blood storage cards: a card chemically impregnated to lyse cells and stabilize DNA (Whatman FTA Classic Card) and an untreated card (Whatman 903 Protein Saver Card). Remaining EDTA whole blood was centrifuged and the plasma was stored in 0:5-mL CBS™ straws in LN vapor phase. A 3:0-mL EDTA BD Vacutainer ® tube (tan top, metal free) was stored untouched at −20 C for future analysis of metals, trace elements, and environmental contaminants. One ACD-B whole blood Vacutainer ® tube (yellow top) was aliquotted and cryopreserved with 10% DMSO (dimethylsulfoxide) using a freezer that reduces the temperature to − 80 C in preprogrammed steps to improve cell viability. Twelve percent of the time, this ACD-B tube was selected for lymphocyte isolation under an alternative protocol, with selection based on an algorithm that oversampled women from a high-risk group based on age of enrollment and the affected sister's age at diagnosis. The buffy coat (lymphocytes) was isolated from the whole blood, washed, resuspended, and stored in LN vapor phase. For urine, a basic chemistry urinalysis (Multistix Pro 10LS reagent strips) was performed immediately upon receipt to measure protein, creatinine, blood, leukocytes, nitrite, glucose, ketone, pH, and specific gravity (Bayer Clinitek 500). Urine was aliquoted into twenty 0:5-mL CBS™ straws, five 1:0-mL vials, and one 3:6-mL vial. The straws were stored in LN vapor phase and vials in − 80 C mechanical freezers. Toenails were stored in envelopes at ambient temperature, with the large toenails kept separate from all other toenails. Participants collected dust from three locations in their home using prepackaged alcohol wipes, which were stored at − 20 C after receipt. Samples from a single individual were stored across multiple freezers, and extensive quality assurance measures were put in place to track and document conditions for each sample. See https://sisterstudy.niehs.nih.gov/English/ specimen.htm for further details.

Follow-up
Participants are contacted each year for either a short ( ∼ two pages) annual update questionnaire or a detailed follow-up questionnaire (approximately every 2-3 y; two to three booklets of 20-30 pages   each). Questionnaires are typically offered on the web first, then on paper, followed by telephone contact and a CATI questionnaire. To ensure maximum retention and response rates, there is a comprehensive prompting protocol for nonresponders at each stage, which can include email, postal mail, and/or telephone, as appropriate to the participant's contact history. Women who are more vulnerable to nonresponse (e.g., women with a history of slow response) are assigned a personal study advocate who regularly reaches out to them to encourage completion of study activities and help with prioritization of tasks. Follow-up questionnaires include updates on menopausal status and health, including incident breast cancer, updates on exposures, and new exposures of interest. Women reporting breast cancer are asked to provide additional details and permission to retrieve medical records and paraffin-embedded tumor tissue blocks. Pathology reports or authorization to retrieve pathology reports are requested following report of other cancers. In 2014, a second home visit (including a second blood draw) was completed for a subset of 2,461 Sister Study participants (breast cancer cases and a random sample of the cohort).

Response
As shown in Figure 1 and Table S2, nearly 90,000 people completed an eligibility questionnaire through the website (62.1%) or by telephone (37.9%). Very few women who completed a web screener were found to be ineligible for the study during their subsequent enrollment call, whereas 16.0% of those who only telephoned were ineligible. In all, 62,813 women provided verbal consent to join the study. Of the women who signed up, 81.0% (n = 50,884) completed required baseline activities and were enrolled; an additional 3,066 who completed some, but not all, of the required baseline activities before the enrollment end date. They are being followed through mortality linkage (and possibly cancer registry linkage) as a "passive cohort." Most commonly, participants completed at least one telephone interview prior to home exam, however 18.7% of participants had their home visit prior to completing CATI1; 38.0% completed it after CATI1 and before CATI2. The median time between CATI1 and home visit blood draw for all women was 22 d (interquartile range, 10-44 d).
The vast majority of participants also completed all four selfadministered questionnaires and provided all biological samples, along with written consent to use their samples (see Table S3). Women in the passive cohort most often completed only CATI1 [although some also completed CATI2 (41%) or provided biospecimens], completing home exams prior to CATI completion. Biological samples from these women were anonymized for pilot studies.
Among participants, 40.2% reported their primary referral source was a sister with breast cancer or some other family member or friend (see Table S4). Print materials such as magazines and newspapers were the next most cited (32.3%). All direct mail and email efforts yielded enrollees; however, success rates varied. Emails sent through a trusted source such as Essence or the Susan Love Army of Women were much more successful than an unendorsed email to women from a purchased list (data not shown). Endorsements were also successful, particularly those of TV newswoman Robin Roberts and Luisa Gándara, wife of the Governor of Puerto Rico, resulting in hundreds of African-American enrollees and over a thousand Latina enrollees (data not shown).

Baseline Characteristics
The median age of Sister Study participants was 55.6 y (range, 35:0-76:5 y) at completion of all required enrollment activities (Table 2). They were predominantly non-Hispanic white (83.7%) and married or living as married (74.7%). The vast majority had some college education (84.7%), with just over 50% having a bachelor's degree or higher. Educational attainment was high across all race/ethnicity groups. The fraction that were nonwhite decreased with older age (see Figure S1). Participants were generally well off; approximately two-thirds reported having a total annual household income of between $50,000 and $200,000. Household size was generally small and only a quarter (26.5%) reported children (<18 y of age) in the household at time of enrollment. Residential distribution across the United States and Puerto Rico is shown in Figure 2; women from all 50 U.S. states and the District of Columbia participated.
As seen in Table 2, women in the passive cohort were more likely than full participants to be younger, of a race/ethnicity other than non-Hispanic white, unmarried, have less than a bachelor's degree, and have a household income of less than $50,000; they were also more likely to have children in the household.
Over half of participants never smoked and only 8.2% were current smokers (Table 3). Light/moderate alcohol consumption was common, with only 19.1% reporting no current alcohol consumption. Approximately two-thirds of participants had BMI in the normal or overweight range (37.1% and 31.8%). Although never having been diagnosed with DCIS or breast cancer was a study requirement, 59 women were diagnosed sometime before completing their final enrollment activity and were retained in the cohort. Approximately 5% of Sister Study participants reported having been diagnosed with other cancers prior to enrollment. As can be seen in Table 4, most participants had menarche at 12-13 y of age, most were parous (median number of births, 2), completed their first full term pregnancy in their 20s (median age, 24 y), and had used hormonal birth control (85.2%). Most were postmenopausal (65.5%) with 15.6% reporting surgical, medical, or other (nonnatural) types of menopause. Nearly half had used some form of hormone therapy. Prevalence of health screenings was high; virtually all participants had had at least one mammogram and 80.5% reported having had one within the previous year.
There were no restrictions on the number of sisters within a family that could join the Sister Study. We identified 4,318 sibships with more than one sister in the cohort through linkage using birth dates and other familial details, comprising 18.8% of the cohort.
Although the vast majority of participants (95.8%) had at least one full biologic sister who had been diagnosed with breast cancer as of enrollment (Table 5), women with half-sister(s) with breast cancer were also eligible. At enrollment, most participants had a single sister with breast cancer (89.8%); 18.7% had a mother with breast cancer. Over half of participants (57%) had a first-degree female relative (full sister, mother, or daughter) with young onset (<50 y of age at diagnosis) disease. Those families were targeted for a family-based "Two Sister Study" (Fei et al. 2012). Just under 4% had a first-degree family history of ovarian cancer.
Although most women had a Gail score in the high-risk range at enrollment, 16.5% had a 5-y risk score below 1.67%, the National Cancer Institute cutoff for defining high risk. Approximately two-thirds had a lifetime Gail score of <20%, another cut point used to indicate high risk (American Cancer Society 2016; Graubard et al. 2010). We did not ask directly about Ashkenazi Jewish heritage, but 22% reported Eastern European ancestry (data not shown). Few reported testing for BRCA1 or BRCA2 (3.1%). Of these, 17.3% (n = 256) reported being told they had a mutation in a known breast cancer gene.

Cohort Retention and Response Rates
Response rates (n of responses during field period=n assumed alive at start of field period, where n of responses = n of completed questionnaires + n of deceased during field period) for the first three short updates (i.e., annual update questionnaires) were 96.3%, 95.6%, and 94.0%, respectively. Response rates for the first three detailed follow-ups were only slightly lower at 94.9%, 92.1%, and 91.0% (completed August 2016) despite the significantly longer questionnaires (see the Sister Study website for follow-up questionnaires: https://sisterstudy.niehs.nih.gov/English/ fu-data.htm). As of July 2017, 1,643 (3.2%) participants are known to be deceased and 1,716 (3.4%) of the 50,884 women enrolled in the Sister Study have withdrawn (i.e., requested no further study contact-including just 2 participants who requested that their  Participant is postmenopausal if no menstrual period in the last 12 mos or had any qualifying medical intervention that caused menstrual periods to cease. Qualifying interventions include both ovaries removed; chemo/radiation that stopped periods; hysterectomy, ablation, or embolization and ≥55 y of age; ovarian suppressing drugs or contraception that eliminated menstrual flow and ≥55 y of age. d Hormonal birth control includes birth control pills or patches, Norplant implants, Depo-Provera injections, IUD containing hormones. e Includes participants who have had a surgical biopsy other than a needle biopsy (e.g., excisional biopsy).
data not be included in any new analyses). Thus at least 47,525 (93.4%) women are still actively participating in the cohort. To date, medical records have been obtained for 81.1% of those reporting an incident breast cancer diagnosis included in Data Release 5 (https://sisterstudy.niehs.nih.gov/English/brcavalidation.htm). Using data from an earlier data release (Data Release 4; 82.0% with medical records), we evaluated the validity of a self-reported diagnosis (D'Aloisio et al. 2017). Among those with medical records, the positive predictive value (PPV) was better than 99% for total and invasive cancer. The PPV was also high for self-report of breast cancer subtypes such as ductal cancer (98.8%) and estrogen-receptor positive breast cancer (99.3%) (https://sisterstudy.niehs.nih.gov/English/brca-validation.htm and D' Aloisio et al. 2017). Tumor tissue blocks have been obtained for 1,683 women with incident breast cancer as of Data Release 5.

Discussion
The Sister Study was designed to address concerns that not enough was being done to evaluate the impact of widespread environmental exposure to chemicals on trends in breast cancer incidence. In many ways, the Sister Study design anticipated the subsequent recommendations of commissioned review panels such as the Institute of Medicine (2012) and the Interagency Breast Cancer & Environmental Research Coordinating Committee (2013). We have created a resource to, as these panels suggested, prospectively address questions about genetic factors and environmental exposures during time periods of potentially higher sensitivity to breast cancer induction and/or progression, including gestation, early life, childhood, and the reproductive and perimenopausal years. Furthermore, the extensive and varied data and biospecimens collected is supporting a wide array of mutlidisciplinary research projects.
One of the main difficulties inherent in cohort studies focusing on breast cancer is the need for extensive data collection (reproductive data, known and suspected risk factors, potential confounders and effect measure modifiers as well as biological and environmental samples) while achieving and maintaining high participation and retention rates. The Sister Study approach was to recruit women already disposed to be engaged and motivated by their personal experience with family members' diagnoses to participate in a long-term, detailed data collection effort aimed at better understanding, and possibly preventing, breast cancer. Not coincidentally these same women are, on average, at modestly elevated risk of breast cancer themselves, making them good candidates for a well-powered study, especially given the need to evaluate gene-environment interactions.
Requirements for the Sister Study were substantial-the twopart baseline interview averaged 2 h, and the home visit required as much as another hour of participant time. The number of women contacting the Sister Study ( ∼ 89,000) is evidence of great interest, and the high proportion of women who signed up that completed all the required baseline activities (81.0%) attests to the level of participant engagement and dedication. Women who did not complete baseline activities were more likely to be nonwhite and younger, with relatively lower income and education levels, and more likely to either live alone or have larger households. Nonetheless, thousands of women with these characteristics did complete the baseline activities.
Great effort was devoted to enhancing the number of nonwhite women participating in the Sister Study; 16.3% of the cohort consider themselves a race/ethnicity other than non-Hispanic white. Although lower than the percentage in the United States, that fraction is higher than in other national cohorts (Hays et al. 2003;VanKim et al. 2017) with the obvious exceptions of the Black Women's Health Study (Russell et al. 2001) and the Multi-Ethnic Cohort (Kolonel et al. 2000), which were designed to target specific racial/ethnic groups. Regardless of the proportion of nonwhite women in the United States and Puerto Rico, as long as the relevant data are collected, these cohort participants can and do serve as the basis for valid conclusions regarding potential risk and modifying factors within and among differing race/ethnicity groups.

Design Considerations
The Sister Study is a risk-based prospective cohort study (Weinberg et al. 2007). One advantage of this design is the more rapid accrual of incident cases than non-risk-based designs, markedly enhancing power to detect etiologic factors and geneenvironment interactions. The increased power is driven, in part, by higher prevalence of potentially relevant environmental risk factors and by increased prevalence of multiple relatively uncommon genetic susceptibility factors, rather than by rare high penetrance genes (Weinberg et al. 2007). Modest enrichment of even unidentified genetic risk factors will also enhance the ability to detect any environmental factors with which they interact (Shen et al. 2017;Weinberg et al. 2007). Based on age-specific incidence rates of invasive breast cancer from SEER (the National Cancer Institute's Surveillance, Epidemiology, and End Results program), proposed age ranges for recruiting, and an on-average 2-fold increased risk among women with a sister with breast cancer, we estimated that approximately 1,500 cases of invasive breast cancer would accrue in the first 5 y of the Sister Study (300 cases per year). As of Data Release 5 there were 2,163 cases of invasive breast cancer with an average follow-up time in the full cohort of 7.5 y (average number of cases per year = 288). As might be expected in a cohort of women with a sister history of breast cancer, breast screening is common (95% report a mammogram within the 2 y prior to baseline) and most of the invasive cancers (91%) are early stage (Stage 1, 66%; Stage 2, 25%).
At study initiation, there was no consensus on whether lobular carcinoma in situ (LCIS) should be considered breast cancer or a risk factor for breast cancer. Consequently, at that time some women diagnosed with LCIS were told they had breast cancer and others were told they did not. In order to avoid confusion to potential participants, LCIS was not considered an exclusion criterion for the Sister Study. Women with LCIS diagnosed prior to enrollment can be handled analytically in a number of ways, depending on study question.
As with other volunteer cohorts, participants in the Sister Study are not entirely representative of the U.S. population, although they do reside in all 50 U.S. states, the District of Columbia, and Puerto Rico. In addition to having a sister with breast cancer, they are generally older and more educated and more likely to be white, relatively economically well off, and healthy. However, population representativeness may not be an appropriate or necessary goal (Rothman et al. 2013). Rather, for a scientific study, as opposed to a study measuring population attributes (e.g., population prevalence of disease or exposures), the goal should be to achieve internal validity and scientific generalizability, not population representativeness (Rothman et al. 2013), something with which it is occasionally conflated. One would expect that whereas exposure prevalence may vary among specific study subgroups, and perhaps influence power to detect a substantive relative risk, the exposure would exert its effect in a mechanistically similar fashion. For the Sister Study, scientific generalizability, as opposed to representativeness, means the ability to draw conclusions about circumstances and mechanisms of disease etiology relevant to women in specific subgroups, as well as (potentially) to women in general. Nonetheless, as with other U.S. cohorts, the degree to which the Sister Study cohort is similar (or dissimilar) to the U.S. population is of interest and may affect interpretation of any findings. Consequently, a manuscript comparing Sister Study participants to women in NHANES, a nationally representative sample of the U.S. population, in terms of classic breast cancer risk factors, lifestyle factors, morbidity, and other factors, is in preparation.
Some might worry that women with a sister history of breast cancer would be overly vulnerable to highly penetrant gene mutations, such as those in BRCA1 or BRCA2. However, such genes are unlikely to dominate breast cancer etiology in the Sister Study cohort (Weinberg et al. 2007). Based on a 2% prevalence of BRCA-positive status in breast cancer cases, the prevalence in sisters of breast cancer cases would be about 1%. If the odds ratio is 10 for carriers, one would expect fewer than 8% of the invasive breast cancer cases in the first 5 y of the study to be BRCApositive (Weinberg et al. 2007).
Although any cohort including an element of family history in the selection criteria would be expected to have an elevated risk relative to the general population, we believe it would be a misnomer to characterize the Sister Study as high-risk, a specific term typically defined by cut points. Rather, the Sister Study should be characterized as a cohort with modestly elevated average risk, composed of participants with a wide range of absolute risks. Given the modestly increased risk conferred by a having a sister with breast cancer (approximately 2-fold) and the heterogeneity in risk among those with different family histories (Collaborative Group on Hormonal Factors in Breast Cancer 2001), as well as variation across individuals in established reproductive, lifestyle, and environmental factors, the Sister Study cohort includes participants with a wide range of risk levels. The notable breast cancer risk heterogeneity in the Sister Study is apparent in the distribution of Gail scores. At the lower end of the risk spectrum, 16.5% of the cohort had 5-y Gail scores indicating average or lower risk relative to U.S. women, whereas at the upper end, 32.5% were in a high-risk group (lifetime risk of breast cancer of at least 0.20) (American Cancer Society 2016; Graubard et al. 2010). Only 5.4% had two or more first-degree relatives with young-onset (<50 y of age) breast cancer, whereas three quarters of the Sister Study participants have only a half-sister or single first-degree relative with breast cancer (3.4% and 71.5%, respectively). This argues that the various biologic mechanisms underpinning breast cancer development are well represented in the Sister Study cohort.

Challenges and priorities
One of the first challenges with risk-based sampling is to identify those at elevated risk but without disease. Identifying and connecting with women who have a sister with breast cancer is not straightforward. There are no lists of such women. We considered strategies based on identifying women with breast cancer and recruiting their sisters. Cancer registries proved inefficient and impractical because of the large number of separate cancer registries that would need to be approached, each with its own application and IRB requirements, and the necessity of contacting the potential participants only through the sister with breast cancer rather than directly. Survival bias would also have become a serious issue. Recruiting through cases participating in existing casecontrol studies was also considered, but few investigators could share needed contact information.
Some studies have restricted recruitment to states with SEER registries to simplify breast cancer identification and case confirmation. However, many of our strategies involved nationwide approaches; restricting enrollment to selected states would have been a disincentive for our many partner organizations. Consequently, we prioritized geographic diversity over the ease of cancer validation.
By focusing on sisters of women with breast cancer, women without sisters are excluded. In fact, women from larger families were likely oversampled because the chance of having a sister with breast cancer rises with the number of sisters in a family.
Because the risk enhancement for women with many sisters but only one with breast cancer will be less than that for women with a single sister, who has breast cancer, it is possible that sibship size could become a confounder in this cohort if relevant risk factors (e.g., parity, age at first birth) are associated with family size. Average family size in the United States was at its highest during the baby boom years (1946)(1947)(1948)(1949)(1950)(1951)(1952)(1953)(1954)(1955)(1956)(1957)(1958)(1959)(1960)(1961)(1962)(1963)(1964) peaking at 3.7 children per woman in 1957, the year that a 50-y-old Sister Study participant enrolled in 2007 would have been born. (CDC 1999) The universe of eligible women is not known, nor do we know how many women would have known about the opportunity to enroll, but response to recruitment overall was good, with 74.8% of identified eligible women signing up for the study after talking to study staff and learning what would be expected of them. Of note, among women who did the eligibility screener by telephone, 92.1% of eligible women signed up for the study, as opposed to the 65.9% of web-screened eligible women who signed up. Had women been allowed to sign up for the Sister Study online after doing the web screener, rather than needing to subsequently call the Sister Study, it seems likely that a much higher proportion of web-screened women would have enrolled; however, subsequent retention might have been lower.
We received funding for additional outreach to women underrepresented in breast cancer research (nonwhite women, women with lower income and education, and older women). For example, women in the southern region of the United States were heavily recruited and response was good. The proportion of the final cohort from the South Atlantic states (21.6%) was similar to U.S. Census . The recruiting push in southern states helped increase the numbers of nonwhite women (mainly African Americans) in the Sister Study. Although black women and Latina women are included in the cohort in sufficient numbers for some stratified analyses, Asian-and Native American women are not. Because family sizes tend to be smaller for some Asian groups (Pew Research Center 2017), requiring a sister with breast cancer likely resulted in a reduced pool of eligible Asian-American women.

Collaborative opportunities
The Sister Study is a rich resource for collaborative research involving scientists inside and outside of the National Institutes of Health. Collaborations may involve use of existing data or samples, and proposals for add-on studies involving new data collection are considered. Because we also collect information about non-breast cancers, and non-cancerous conditions, the Sister Study also offers the opportunity to evaluate environmental exposures with respect to these outcomes. The prospective nature of the study allows for generation and investigation of new hypotheses for a range of outcomes, including studies of cancer survivors. Procedures for requesting access to study data or for proposing add-on or nested substudies can be found on the study website at https://sisterstudy.niehs.nih.gov.

Conclusions
The Sister Study is a unique cohort designed to efficiently study environmental and genetic risk factors for breast cancer. Its risk-based design affords enhanced statistical power to detect geneenvironment interactions. Our goal was to create a resource for studying environmental and genetic contributors to breast cancer and other diseases in women. To date more than 90 papers have been published in the peer-reviewed literature using Sister Study data. The Sister Study has provided the platform for seven extramurally funded grants, including three led by extramural collaborators, and is the basis for collaborative research with the Centers for Disease Control and Prevention (CDC) on survivorship and the impact of a breast cancer diagnosis on family members. In addition, the Sister Study participates in many cohort consortia focused on gene discovery, gene-environment interactions, and lifestyle and environmental risk factors for breast cancer and rare outcomes that cannot be studied in a single cohort. High rates of enrollment and participation over time as well as extensive data collected about exposures over the life-course and baseline biospecimens from nearly all women in the cohort provide many opportunities for studying breast cancer and other health outcomes in women.