Understanding the Spatial Clustering of Severe Acute Respiratory Syndrome (SARS) in Hong Kong

We applied cartographic and geostatistical methods in analyzing the patterns of disease spread during the 2003 severe acute respiratory syndrome (SARS) outbreak in Hong Kong using geographic information system (GIS) technology. We analyzed an integrated database that contained clinical and personal details on all 1,755 patients confirmed to have SARS from 15 February to 22 June 2003. Elementary mapping of disease occurrences in space and time simultaneously revealed the geographic extent of spread throughout the territory. Statistical surfaces created by the kernel method confirmed that SARS cases were highly clustered and identified distinct disease “hot spots.” Contextual analysis of mean and standard deviation of different density classes indicated that the period from day 1 (18 February) through day 16 (6 March) was the prodrome of the epidemic, whereas days 86 (15 May) to 106 (4 June) marked the declining phase of the outbreak. Origin-and-destination plots showed the directional bias and radius of spread of superspreading events. Integration of GIS technology into routine field epidemiologic surveillance can offer a real-time quantitative method for identifying and tracking the geospatial spread of infectious diseases, as our experience with SARS has demonstrated.

Since the emergence and rapid spread of the etiologic agent of severe acute respiratory syndrome (SARS)-SARS coronavirus (SARS-CoV)-in late 2002 and during the first 6 months of 2003, great progress has been made in understanding the biology, pathogenesis, and epidemiology of both the disease and the virus (SARS-CoV). Much remains to be done, however, including the development of effective therapeutic interventions and diagnostic tools with high sensitivity and specificity soon after the onset of clinical symptoms. The evaluation of key epidemiologic parameters and the impact of different public health interventions in the various settings that experienced minor or major epidemics is also needed (Affonso et al. 2004;Cui et al. 2003;Lau et al. 2004;Leung et al., in press). In terms of outbreak control on the population level, many questions about "superspreading events" (SSEs) remain to be investigated. Such an SSE was responsible for > 300 cases (out of a total of 1,755) in the Amoy Garden Housing Estate (AMOY) in the Hong Kong epidemic. Moreover, Donnelly et al. (2003) have demonstrated that there were clear geographic concentrations of microclusters of SARS cases where the density of infection varied widely between different districts.
The application of geographic information system (GIS) methods in health and health care is a relatively new approach that started to gain acceptance a decade ago (Higgs and Gould 2001;Meade and Earickson 2000). In particular, a wide variety of cartographic methods have become available for the mapping and analysis of communicable disease data since the defining work of Cliff and Haggett (1988) and Haggett (1994). Advances in new technologies enable the application of GIS to examine spatially related problems from different perspectives. In addition to the descriptive mapping function, GIS possesses capabilities of data manipulation and geostatistical analysis.
In the present study, we applied GIS technology in mapping and visualizing the SARS outbreak in Hong Kong. In this article we focus on cartographic and geostatistical methods in representing and analyzing the patterns of disease spread during the 2003 outbreak. We also address the utility and limitations of GIS as a real-time disease surveillance tool.

Materials and Methods
Data sources. We used spatial and nonspatial data in this study. Spatial data are geographic in nature and have a physical dimension or location in the real world. These are represented as points, lines, or area symbols, and they form the map base upon which SARS occurrences are depicted. Data on SARS incidence were derived from case-contact interviews that are text based; associated residential address data were first cleaned, checked for completeness and accuracy (e.g., Chinese-English transliteration of building and street names), and then geo-referenced to enable mapping.
We analyzed the SARSID integrated database (coordinated by the Department of Community Medicine, University of Hong Kong, on behalf of the Health, Welfare and Food Bureau-derived from the Hong Kong Hospital Authority eSARS system and the Department of Health's Master List), which contained details on all patients confirmed to have SARS and admitted to hospitals in Hong Kong throughout the entire epidemic, that is, from 15 February to 22 June 2003. The criteria for inclusion in the SARSID were radiographic evidence of infiltrates consistent with pneumonia, fever ≥ 38°C or history of such at any time in the past 2 days, and at least two of the following: a) history of chills in the past 2 days; b) cough (new or increased cough) or breathing difficulty; c) general malaise or myalgia; and d) known history of exposure. However, patients were excluded if an alternative diagnosis could fully explain their illness. Moreover, each case classified as confirmed SARS was verified by the Hong Kong Department of Health according to World Health Organization (WHO) guidelines on case definitions (WHO 2003). Eighty-two percent of the 1,755 cases listed as confirmed SARS had either reverse transcription-polymerase chain reaction results positive for SARS-CoV or a 4-fold increase in IgG antibodies in paired sera (at admission and 21 or 28 days after symptom onset). Two questionnaires (case questionnaire and casecontact survey) were administered, mostly through telephone interviews, to all SARS cases confirmed by the Department of Health, initially by four regional field offices and later by a central interviewing team of nurses, to record symptoms at presentation to the hospital and to identify contacts and events of probable significance to transmission.
A total of 1,709 confirmed cases (out of 1,755 total cases) were extracted for the analysis. Forty-six cases (i.e., 2.6% of the total) could not be pinpointed at an exact location because of inconsistencies in the address entries (So 2002).
Geostatistical analyses. We carried out three levels of analysis: a) an elementary analysis involving simple visual inspection of a geographic phenomenon; b) a cluster analysis attempting the identification of possible "hot spots," and c) a contextual analysis aiming to explain relationships among geographic phenomena (Bailey and Gatrell 1995;Olson 1976).
At the elementary level, the spread of a disease in a community is revealed through the plotting of disease occurrences at residential addresses of the patients enabled with the address matching function in a GIS. Point by point is the simplest form of mapping disease occurrences without accounting for the magnitude at each location, but the sheer number and spread of points could have impeded effective reading of the event. A map of cumulative counts collapses the numerous observations into circles of varying sizes to signify differences in the magnitude of disease occurrences in the community. The circles are proportionally sized to reflect the number of occurrences at the sites, and geographic clustering of disease infection can then be clearly identified.
We also examined the spread of SARS over time on the basis of point patterns. Each disease occurrence was plotted spatially and the spread or dispersion of disease incidence was examined using nearest neighbor analysis based on the R scale. The nearest neighbor analysis is an accepted spatial statistical analysis used by environmental scientists to study species distribution (Krebs 1989) and by crime analysts to explain the levels of dispersion in crime and disorder data (Eck and Weisburd 1995). The R scale assumes that events will be randomly spaced unless something influences the distribution. Three different patterns are possible: clustered (0 ≤ R < 0.8), distributed randomly (0.8 ≤ R < 1.8), or with uniform spacing (1.8 ≤ R ≤ 2.149). A contagious process will give rise to a clustered pattern with near-zero R values.
Cluster analysis involves statistical mapping that generalizes the numerous observations into a statistical surface to highlight spatial variation. A 5-day incubation period, consistent with a previous gamma distribution parameter estimation exercise (Leung et al., in press), was used to restructure the data for a time-series study. A statistical surface was created by the kernel method (Bailey and Gatrell 1995) for each day to reveal daily changes of disease hot spots. A kernel size of 300 × 300 m 2 was used to reconstruct the territory of Hong Kong into a gridded surface of 208 columns and 151 rows. The kernel size was 300 × 300 m 2 , and disease occurrences within a bandwidth of 600 m from the kernel were summarized to yield density measures in terms of number of SARS cases per square meter. Each grid was then designated either as urban or suburban based upon land use classification, and its associated density measure was adjusted for the underlying variation in population density (i.e., kernel density × population density × grid cell size/1,000) to yield infection rates per 1,000 population. We adopted the approach by Kafadar (1996) but modified it to account for variation between urban or suburban population densities within a given district in Hong Kong (Table 1). Each urban or suburban grid was considered a homogeneous unit wherein its population density was apportioned according to the proportion of residents in the employed labor force.
We created 12 kernel maps adjusted for population at risk to characterize changes in disease hot spots on 12 prototypical days over 16 weeks in a chronologic sequence. The infection rates, which span across a wide range, were collapsed into 15 classes to reduce the complexity of map representation. Each of the 15 classes was assigned a shade in proportion Environmental Health Perspectives • VOLUME 112 | NUMBER 15 | November 2004  (2002). a Sum of employed labor force. b Total urban areas within each district divided by district area. c Computed from urbanrelated occupation in employed labor force, defined as follows: rural-related occupation (includes agriculture and fishing); mining and quarrying; urban-related occupation (includes community, social, and personal services); construction; electricity, gas, and water; financing; insurance, real estates and business services; manufacturing; transport, storage, and communications; wholesale, retail, and import/export trades; restaurants and hotels; unclassified. d Marine data were not land based and thus were excluded from the study.  to the magnitudes, with darker shades representing higher densities of infection. Two kinds of indexes were employed to assess the extent of disease clustering: R scale and Moran's I coefficient for more highly connected grids of the queen's case that considers a neighborhood of eight cells in a 3 × 3 matrix. Moran's I coefficient ranges between -1 and 1 and is interpreted as regionalized or juxtaposition of similar values (0.6 ≤ I ≤ 1 indicating positive spatial autocorrelation), lack of autocorrelation, or the actual arrangement of values as one that we would expect from a random distribution (-0.6 < I < 0.6 indicating no spatial correlation), and either contrasting or tendency for dissimilar values to cluster (-1 ≤ I ≤ -0.6 indicating negative spatial correlation). Although R scale is a global measure for the spread or dispersion of disease incidence for point data based on nearest neighbor distance (Eck and Weisburd 1995;Krebs 1989;Taylor 1977), Moran's coefficient measures local spatial autocorrelation for area data (Getis and Ord 1992;Sawada 2001). A comparison of the power evaluation of disease clustering tests has been described by Song and Kulldorff (2003). For contextual analysis, histograms of the kernel data for 12 prototypical days were drawn to highlight variation in infection rates. Also, we replaced mean and SDs of the classed density data with their natural logarithm functions to accentuate the effect of change between near-zero values; we then graphed the values.
We also established a breakdown of disease occurrences by recognized clusters (e.g., SSEs) for contextual analysis. Three disease clusters each with > 30 observations were extracted: AMOY, Prince of Wales Hospital (PWH), and Lower Ngau Tau Kok Housing Estate (NTKLOW). These data were used to derive origin-and-destination (OD) plots or flow diagrams. Lines were drawn to connect an origin location where the flow started (e.g., index source of infection) with related destinations where the flow ended (e.g., residences of secondary contacts). The OD plots are an established methodology employed by transport professionals and human geographers to examine the extent of spatial interaction and human settlement, as well as the modeling of commodity flows (Batten and Boyce 1986). The flow data themselves can be people, goods, telecommunications, and so on. The lines help to delimit the spatial coverage revealing the extent or degree of spread. SD 1552 VOLUME 112 | NUMBER 15 | November 2004 • Environmental Health Perspectives  *p < 0.01, which indicates a tendency toward clumping of disease incidence. **p < 0.001, which implies that spatial autocorrelation exists and that similar values on the map tend to cluster together. ellipses centered on the geometric mean of all locations were drawn to provide a summary trend of the dispersion and to examine whether a distribution has a directional bias. The major axis is the direction of maximum spread of the point events, and the minor axis is the direction of minimum spread. All analyses were carried out using ArcGIS software and its extension modules (Environmental Systems Research Institute, Redlands, CA, USA). Figure 1 illustrates geographic locations of SARS infection by residential address in Hong Kong. The size of the circle corresponds to the density of cases in a particular location. There was clear clustering of cases in certain districts of the Kowloon peninsula (Kwun Tong, in which AMOY is located) and the New Territories (including Shatin, Tai Po), but Hong Kong Island was relatively spared. Table 2 supports this observation: most affected buildings or apartment blocks had very few cases, whereas seven buildings had > 10 SARS-affected patients.

Elementary analysis.
Cluster analysis. A series of 12 kernel maps based on date of symptom onset and accounting for a 5-day incubation period of SARS is presented in Figure 2. Each kernel map shows the density of SARS patients adjusted for underlying population density (i.e., SARS infection rate per 1,000 population) on a prototypical day over 16 weeks, with darker zones emphasizing disease hot spots [see also daily animated series by Lai and Chan (2004)]. A few disease hot spots were shown to be developing in the Kowloon peninsula and southeast New Territories (i.e., Ma On Shan and Shatin) by 10 March, which was followed later by a heavy concentration at the AMOY by 28 March. By early April, the AMOY case load began to dissipate and a new hot spot emerged in Tai Po (northeast New Territories). There is clear evidence of varying degrees of clustering as the epidemic progressed over time based on the low R values. The low R values signify substantial degrees of clustering (significant at 99% confidence level), with higher degrees of clustering occurring around the peak of the infection and relatively small divergences from random distribution at the beginning of the outbreak. High Moran's I coefficients of ≥ 0.6 indicate that similar values tend to cluster together, which confirms the geospatial clustering and thus infectious nature of the disease, based on rates that were adjusted for the underlying population density. Figure 3 summarizes SARS hot spots in Hong Kong considering cumulative disease occurrences from February through June 2003. The map shows that the urban population was at higher risk of contracting SARS (Moran's I = 0.78, p < 0.001), having already accounted for variation in population density.
Contextual analysis. Daily histograms of the number of observations by 15 classes of infection rates, primarily composed of inverse J-shaped curves, show an increased concentration of SARS occurrences toward the end of March (Figure 4). Figure 5 is a logarithmic plot of the mean and SD of the infection rates of the 12 prototypical days representing different stages of the epidemic; values for individual days are presented in Table 3. Pairwise comparisons between each of the prototypical days and day 1 (or the day of indifference) of Environmental Health Perspectives • VOLUME 112 | NUMBER 15 | November 2004   the epidemic demonstrated no detectable difference between the mean infection rates throughout the epidemic. However, there were statistically significant differences, by the F-test at a 0.01 significance level, in the SDs of the middle 10 prototypical days compared to day 1, suggesting unequal population variances during much of the outbreak. Higher F-values indicate more unequal variance. Given that the SD is a measure of geographic dispersion, we can infer that a larger SD signifies a wider spread of the disease over the territory. The crossover points of the mean and SD curves in Figure 5 indicate, on the one end, the beginning of substantial disease spread across the territory, and on the other end, the subsidence of the epidemic. Therefore, the time from day 1 (18 February) through day 16 (6 March) was the prodrome of the epidemic, whereas days 86 (15 May) through 106 (4 June) marked the declining phase of the outbreak. OD plots of disease clusters were obtained by linking patients' places of residence with the likely or probable locations of index cases or environmental sources of infection as defined through contact tracing by public health authorities ( Figure 6). PWH is a tertiary teaching hospital and the site of the first SSE and nosocomial cluster in the Hong Kong epidemic, whereas AMOY and NTKLOW were subsequent community SSE clusters that had a strong putative environmental etiology (viz., sewage pipes, building design, and poor environmental hygiene) in addition to human-tohuman transmission [Hong Kong Special Administrative Region (HKSAR) 2003;Wong and Hui 2004]. As would be expected because of a large patient catchment area, the PWH cluster was more geographically widespread (as supported by the SD ellipses in Figure 6D) compared with the AMOY cluster ( Figure  6B), the sample size of which was one-third larger. The SD ellipses of the PWH cluster ( Figure 6D) reveal a northwest-southeast directional trend of disease spread that extends over most of Hong Kong. The AMOY cluster was comparatively more localized, and the map had to be be enlarged to show the standard ellipses that exhibit an almost east-west directional trend of disease transmission ( Figure 6B). The NTKLOW cluster ( Figure  6F) was the least geographically widespread of the three SSEs, where the very compact spatial distribution must be magnified to visualize details of the SD ellipses. Figure 7 and Table 4 show low R scores (a measure to inform the extent of disease spread) indicating a high degree of clustering for all three SSEs. The R values were significant at the 0.001 level, confirming that the point patterns exhibited a tendency toward clustering. Figure 7 also shows that block E of AMOY (the epicenter of the AMOY SSE), with a lower R score, exhibited a more compact geospatial arrangement in SARS infection than did other apartment blocks within AMOY. Visitors of ward 8A (the epicenter of the PWH SSE where the index patient of the cluster stayed) of the PWH were found to spread the disease farthest from its source of all the three clusters examined here, as would be expected for such a nosocomial outbreak at a tertiary referral hospital where SARS patients were densely aggregated on the ward but visiting relatives and friends returned home situated in different parts of Hong Kong (and not necessarily from the immediate surrounding neighborhood, given that the hospital is one of only two tertiary referral centers in the territory with a very wide catchment area). The NTKLOW cluster recorded the lowest R score, substantiating earlier observations from Figure 6.

Discussion
Our findings show that GIS methods can be usefully employed during an acute infectious disease outbreak to reveal new geospatial information in addition to standard field epidemiologic analyses. This mapping and cartographic technique can provide visual display of information in both space and time simultaneously. When applied in real time during the onset and evolution of an epidemic, it can monitor and enhance understanding of the transmission dynamics of an infectious agent, thereby facilitating the design, implementation, and evaluation of potential intervention strategies. GIS can offer quantitative and statistical measures along with visualization tools to examine patterns of disease spread with respect to disease clusters. Disease mapping is a first step toward understanding spatial aspects of health-related problems, as particular kinds of information are highlighted in maps. Various cartographic symbolizations (as points, lines, or areal patterns) can show the distribution of diseases. Disease clusters and other associations can then be deduced statistically and visually after examining the disease maps. In Chomsky's (1965) terms, analyses at the first two levels concern the surface structure of an event, whereas the third level seeks to extract deep structure information. Surface structure information is simple and immediately perceptible to a user, whereas deep structure information is content-specific knowledge needed for problem solving (Nyerges 1991).
In the case of SARS in Hong Kong, our study, first and foremost, demonstrates exceptional spatial clustering of the cases. The kernel method adjusted for population density provided a means of highlighting population at risk, whereas the use of R values and Moran's coefficients in conjunction with map displays enhanced the analytical context of the point pattern distributions. In fact, such geospatial 1554 VOLUME 112 | NUMBER 15 | November 2004 • Environmental Health Perspectives    intelligence gathered from examining statistical surfaces and disease clusters provided the basis for the formulation of our transmission dynamics model . More specifically, choice of a suitable framework was not straightforward in constructing the transmission dynamics model where a variety of approaches were possible, ranging from a simple deterministic compartmental approach to a spatially explicit, individual-based simulation. Given the data available for Hong Kong, we based our analyses on a stochastic metapopulation compartmental model. A metapopulation approach was appropriate because the incidence of SARS varied substantially by geographical district, as the GIS analyses have shown. Second, the simultaneous geospatialtemporal approach to modeling the SARS outbreak revealed complementary additional information that would otherwise not be available from the traditional epidemic curve method (a standard public health outbreak investigative approach) in identifying the mode of spread. The daily animated series of kernel maps clearly shows that SARS was a highly localized disease; thus, its route of transmission was unlikely to be through casual contact, as it is for influenza and measles, but more compatible with close contact via heavy respiratory droplets and fomites. This confirms that SARS is only a moderately transmissible condition with a basic reproduction number of about 3 , in contrast to measles and influenza, which have basic reproduction numbers of about 13 and 5, respectively (Anderson and May 1991;Ferguson et al. 2003). An alternative interpretation of the observed high degree of geospatial clustering would be that SARS was due to an environmental point source outbreak. Indeed, faulty sewage systems and the "chimney effect" is the leading hypothesis explaining the AMOY SSE (especially block E), although some have suggested roof rats as a vector (Ng 2004). Although it is difficult to gauge retrospectively, had the GIS system we implemented in this report been available for near real-time analysis, it would likely have detected the highly unusual clustering of cases in SSEs such as the PWH and AMOY outbreaks much sooner, as they evolved. This in turn could have resulted in more rapid contact tracing and public health intervention, thus perhaps mitigating the extent of spread substantially in the case of person-to-person transmission events and preventing further large-scale environmental point source outbreaks in residential apartment blocks (although it would not have made a difference to AMOY itself given the temporally abrupt and short-lived environmental release of viral particles).

Daily incident number of SARS cases SD Mean
Third, contextual analysis of mean and SD values of different density classes, particularly after logarithmic transformation to accentuate Environmental Health Perspectives • VOLUME 112 |   near zero values on a graph, provided a geographic approach to estimating the beginning and subsidence of a large degree of spread of SARS in the community. This is a useful adjunct to the usual biomathematical modeling approach using reproductive numbers at different points in time, representing the average number of infections, excluding SSEs, caused by infected individuals in successive generations at time t throughout the SARS epidemic . Fourth, the SD ellipses from the OD analysis, coupled with complementary results from R and Moran's I values, yielded information on the direction of spread in a disease cluster that can be used to inform contact tracing and the design of quarantine measures. In the case of SARS in China, where entire residential districts were cordoned off for weeks at the height of the outbreak, the selection of such districts for quarantine could have been better informed by these ellipses indicating directional bias and associated physical distance in disease transmission.
There are, however, limitations and caveats to the GIS technique in infectious disease epidemiology and outbreak investigation. Howe (1963) argued that mapping of diseases tended to expose the "where" but not "why there" of the outbreak. Nevertheless, elementary descriptive analysis as an output of disease mapping can be a source of new leads for further exploratory analyses. Map patterns can provide stimuli for generating hypotheses of disease causation (Lloyd and Yu 1994;McKee et al. 2000). Moreover, newer developments that complement traditional mapping functions such as cluster and contextual analyses can be very useful adjunct investigative tools in outbreak control, as our example on SARS in Hong Kong has highlighted.
The completeness and availability of necessary data are another area of potential concern where conventional field epidemiologic data collection forms rarely contain the full range of variables that are required in a GIS analysis. Data consistency and, in particular, the nonstandardization of patient address formats is one such example. Field epidemiologists often relegate certain personal particulars such as residential and work addresses to a lower priority in their data collection procedures, or at least enter the information in a haphazard fashion, rendering GIS analysis very difficult by diminishing the proportion of usable cases for analyses. Similar generic problems that plague the establishment of all information systems must be resolved to enable real-time disease monitoring and surveillance. They include lack of standardization for data capture documents, procedures and protocols for information management, delays in transferring and updating information, and a lack of rapid analysis and audit of databases. The SARS epidemic is a clear signal that Hong Kong needs much greater and sustained investment in health informatics, that is, public health information systems, the skills to use them, and networks to share them.
In summary, integration of GIS technology into routine field epidemiologic surveillance can offer a scientifically rigorous and quantitative method for identification of unusual disease patterns in real time, as our example of SARS has shown. Its potential can be synergistically maximized when linked with clinical databases collecting data at the point of care across the whole population as well as environmental data sources (e.g., meteorologic, transportation, topographical information) to rapidly recognize, locate, and monitor disease outbreaks.