AFTER ITS INCEPTION IN the early part of the last century1 and its early development by Jakob Moreno,2 social network analysis was focused on the small group—how persons connect to each other and how dyadic interactions are influenced by the surrounding social context. An extensive sociological and theoretical base3 developed in parallel with considerable mathematical development in graph theory.4–6 A basic question for the field was perhaps best stated by Granovetter in 1973: “… A fundamental weakness of current sociological theory is that it does not relate micro-level interactions to macro-level patterns in any convincing way … how interactions in small groups aggregate to form large-scale patterns eludes us … ”.7 He suggested that an understanding of social networks provided the conduit between small- and large-scale patterns.
Thus, the field of social network analysis was primed to explore several extraordinary events that unfolded in the 1980s. The advent of widespread infection with human immunodeficiency virus (HIV) occurred at virtually the same moment as the introduction of widespread personal computing. The availability of technological tools has supported the theoretical, simulation, and empirical efforts to understand the transmission dynamics of infection with HIV, as well as other sexually transmitted diseases (STDs) and blood-borne infections (BBIs). By the mid-1990s, an arcane technical tool whose origin dated to the 1960s (the ARPAnet)8 became available for general use. The result has been the development of social interaction on a previously undreamed of scale (the World Wide Web is currently estimated to have in excess of 2 billion web pages and to contain approximately 170 terabytes [170 × 1012 bytes] of information).9 The challenge of understanding the dynamics of network growth and the properties of huge networks has stimulated considerable recent theoretical development, and a reconsideration of tools for network analysis. For example, the distribution of the number of contacts per network node (degree distribution); the length of the smallest path between nodes (small world phenomena); the mechanisms for network growth (preferential attachment; component formation); assortative mixing (the extent to which like connects to like); transitivity (the extent to which 2 people or nodes that connect to a third also connect to each other)—all were established concepts well before the World Wide Web, and all have been illuminated by the voluminous data now available (see Ref. 10 for a recent summary).
How insights about large networks apply to the smaller networks in which disease transmission takes place is a potentially fertile area, both theoretically and empirically, since there may or may not be cognates between large and small. Morris11 has postulated that relatively simple rules for personal behavior will, when taken in the aggregate, lead to particular network configurations that can be related to disease transmission. In particular, she cites the importance of mixing patterns and partner sequencing. The former refers to the choice people make with regard to the age, sex, ethnicity, and other characteristics of their partners, and in particular whether they choose partners that they resemble (assortative mixing—for example, highly sexually active people choose others who are highly active) or partners who are significantly different (disassortative mixing—for example, highly active CSW and their paying partners). Sequencing speaks to the timing of sexual partnerships. Those who partner without temporal overlap are termed serially monogamous. Those who have sex with 2 or more people in the same time interval have concurrent partnerships. Many of the theoretical approaches that deal with the large networks use a “top-down” approach: the simulated network is constructed based on axioms and algorithms for connection.12,13 In contrast, Morris’ “bottom-up” approach hypothesizes that network structure emerges predictably from the choices people make. Newer statistical methods provide tools for empirically based simulation that can model transmission in networks formed by such choices.14 In this initial descriptive study, we combine data from studies of disease transmission to examine differences and similarities in network structure and to offer several hypotheses about what local rules may be associated with which global properties.
We requested lead investigators (DB, SF, AJ, CL, JJP, RT, The Network Study Consortium) on 15 completed network studies to contribute data, documentation, and study expertise to the process of joint analysis (Table 1). These studies were the product of individual investigator initiatives, without a priori planning to provide a uniform data set or a coordinated design to evaluate sampling methods or subject selection. Nonetheless, though individual survey questions and definitions varied, there was sufficient commonality in the investigators’ approaches to permit the construction of a uniform data set.
The basic unit of analysis was the dyadic observation—information about a respondent, a contact, and their interaction. Each observation was identified by the study that contributed it, the unique identification number of the respondent, the unique identification number of the contact, the interview sequence number (for those studies that reinterviewed respondents), and interview date. Considerable effort was invested in deduplication of persons in each study, either during the study itself, or ex post facto using a computer algorithm. Thus, in the final data set, all individuals had a unique identifier, which they retained whenever they appeared as either respondent or contact within a particular dyad. The basic information set included the age, sex, race, ethnicity, behavior, sexual orientation (that is, MSM or not), and education level of the respondent and of the contact in the dyad. In addition, diagnostic information (different for different studies) with dates was available for the respondent and, in some instances, for the contact as well. Finally, the types of interaction within the dyad were enumerated (sexual, drug using, needle sharing, and social). Where available, the dates of earliest and most recent sexual or drug encounter were also included.
The critical step in sociometric analysis is deduplication, that is, unique identification of individuals, whatever their nodal position. Without unique identification, structure and dynamics in networks are difficult, if not impossible to assess. In the majority of the networks included in this analysis, deduplication was performed “on the fly” by field staff and supervisors. Techniques included case-by-case evaluation, data exchange by field staff, development of name lists, post hoc confirmation based on subsequent interviews, special field work, and, in some instances, some simple matching based on several variables. In one instance (Baltimore), the construction of the study was dictated by limits on data collection placed by the local IRB. For purposes of deduplication, a complex algorithm developed by one of us (SQM) was applied, using all available personally identifying information including first and last name, nickname(s), date of birth or age, race/ethnicity, gender, address (or location), and phone numbers. This computer-assisted matching was followed by manual confirmation of potential matches (sometimes with benefit of prior program knowledge), and final cross-checking of results (details provided on request).
Calculation of Network Properties
We focused on network properties that have been applied in recent years to the large networks (e.g., the World Wide Web) and databases (e.g., scientific collaboration networks) that are now amenable to computer analysis. Specifically, we examined degree distribution, component structure, small world characteristics, concurrency, assortativity, and transitivity. These were calculated using either each entire data set or the largest connected component from each data set, as appropriate.
The degree distribution is a plot of the frequency of the number of persons who have degree k (that is, who have k contacts), where k varies from 1 to the maximum number in the group. We examined the log–log plot of the cumulative probability distribution for k—the log of the cumulative probability of k on the y-axis against the log k on the x-axis. This procedure permitted both visual and statistical assessment of whether the degree distribution had a long tail to the right, and if it was a good fit (a straight line) to a power curve (P[x] = ax−(b−1)). We used the method suggested by Newman15 to extract the power law coefficient from the data (b = 1 + n[Σi(xi/xmin)]−1, where b is the coefficient, xi is the degree, and xmin is the minimum degree). In accordance with his discussion, we used data in each case from the linear-appearing tail of the degree distribution so that xmin was similar but not identical for each of the studies. We examined degree distributions for the subset of dyads in which both persons were interviewed and for all respondent-contact pairs, using the interviewing interval particular to each study (usually less than 1 year).
We calculated component structure and small world characteristics using Ucinet-6.16 We computed the number and size of connected components, the proportion of nodes in the largest component and compared the size of the largest and second largest components. We assessed small world phenomena from the mean geodesic (shortest distance between 2 nodes in a connected component where all nodes are reachable) and the diameter of the network (the longest shortest distance between 2 nodes) for the largest and second largest connected component in each network. To correct for the effect of network size on the shortest distance between 2 nodes, we computed a corrected mean geodesic by dividing the mean geodesic by the log10 of the network size. Though this number has no physical meaning (the geodesic is the actual distance), it provides some comparability for assessment since it “prorates” the mean geodesic by the order of magnitude of the network (all things being equal, larger networks can be expected to produce greater distances between people). As evidence of small-world effect, we compared the relationship of the (uncorrected) mean geodesic for each study with its sample size and with the log of its sample size.
Concurrency, as described by Watts and May17 and developed by Morris and Kretzschmar,18–20 is a person’s practice of having more than 1 partnership (usually sexual) during a given time period. It is generally contrasted with serial monogamy—having successive partners who do not overlap in time—and may be thought of as an open triad. The Morris–Kretzschmar κ is a measure of concurrency that can be calculated from the mean and variance of degree in a network (κ = (ς2/μ) + μ − 1). It is formally defined as the number of concurrent partnerships per partnership.
Assortativity is the propensity of persons with similar characteristics to associate with each other (disassortative mixing obviously implies the association of persons with dissimilar characteristics). A common example of assortative mixing is the preferential association of persons of the same ethnicity. Disassortative mixing is demonstrated by the common phenomenon of young MSMs associating with older MSM. We used Newman’s measure of assortativity, A = Tr(e) − ∥e2∥)/(1 − ∥e2∥), where e is the vector of entries from the network mixing matrix (eij), Tr(e) is the trace of the matrix, and ∥e2∥ is the sum of the squared entries.21 The subscripts refer to categories of the variable being assessed (e.g., age groups) and the entries are the fraction of connections in each cell. To assess degree assortativity we compared only those dyads for which we had contact information on both members, and used degree categories of 1–2, 3–6, 7–13, and 14–66. For purposes of stratification by degree (that is, to assess assortativity for age and sex within degree categories) we used: degree = 1, 2, 3–4, 5–9, 10–66. For age assortativity, we used 5-year age groups from 15 to 49. For ethnicity, we classified persons as white, black, Hispanic, or other. Finally, we compared the assortativity results using information from egocentric interviews only with information obtained from sociometric analysis (that is, incorporating the additional connections identified from the network sociograms). We did not assess assortative mixing by sex for sexual contact since it is wholly determined by the mix of sexual orientations within the network. An alternative approach, suggested by Morris,22 that provides a log-linear framework for assessing the statistical significance of overall homophily or variable-specific homophily, was not attempted here.
Transitivity, also known as Clustering, is a measure of the proportion of “closed” triads in a network. That is, if A relates to both B and C, a closed or transitive triad would be present if B and C relate to each other as well. We used the procedure “Transitivity” available in Ucinet-616 to calculate the proportion of closed triads in a network. Transitivity was calculated for the largest and second largest component in each study, separately for the entire group, for sexual contacts, and for needle-sharing contacts.
In all 15 studies, the visual display of degree distribution suggested a long tail to the right. For the subset in which both partners in the dyad were interviewed, 9 of 14 had power law exponents between 2.0 and 3.0 (Table 2, column labeled “Interviewees only”). Two of the exceptions (Bushwick, 1.967 and Chlamydia, 1.973) were borderline. The other 3 (Flagstaff, 4.744; Houston, 3.160; Manitoba, 1.626) were well outside the “scale-free” range. Using the degree distribution for all contacts (Table 2, column labeled “All contacts”), 14 of the 15 studies had exponents between 2.0 and 3.0. The only exception, Flagstaff, had an exponent of 4.744. When dyads from all 15 studies were combined, considerable smoothing became apparent (Fig. 1), and power law curve for the straight-line portion of the distribution had a coefficient of 2.077.
The predominant component structure consisted of a single, large component with multiple smaller ones (Table 3), but this structure varied with the underlying sampling scheme. Flagstaff and Atlanta used a chain link sampling design that produced a single connected component. The scheme was replicated once in Flagstaff and in 3 separate neighborhoods in Atlanta. The latter, therefore, had only 3 components of approximately equal size. The Urban2 project in Atlanta used a snowball design that produced a single connected component. For the studies that used contact tracing procedures to identify contacts, the component structure reflected the extent of interrelatedness of the subjects. The Rockdale study uncovered a single, highly connected group. The PPNG study performed contact tracing on a highly interconnected group, producing a large connected component that contained 90% of the subjects. Manitoba, Chlamydia (Colorado Springs), GC1981 (Colorado Springs), HIV (Colorado Springs), and Syphilis (Atlanta) used contact tracing procedures on more diffuse groups, and though each had a large connected component, it constituted a smaller proportion of the total group. Bushwick, Houston, and Project 90 (Colorado Springs) all used a purposive chain-linked design and each produced a dominant connected component, substantially larger than the second largest component. The Antiviral (Atlanta) and Baltimore projects recruited individuals, used egocentric interviews, and attempted some post hoc sociometric analysis. Their largest connected components were relatively small, and constituted a small proportion of the total population.
All of the networks examined exhibited small world characteristics (Table 4). Visual inspection of the data makes it clear that larger networks have larger mean geodesic paths, a phenomenon readily explained by the constraint on path length imposed by small network size. The corrected mean geodesic reduces the length of all the geodesics, but proportionally more for those in the larger networks, and demonstrates that the corrected mean geodesics for all 15 studies were similar (mean, 2.45, 95% C.I. 0.85–4.05, range 1.3–4.3). Thus, though there appears to be considerable variability in the mean length of geodesics in these 15 studies, with several that are greater than 9.0, such variability may be a distortion induced by variable network size. If mean geodesic is plotted against network size, the R2 is 0.15. When plotted against the log of network size, the R2 is 0.60.
Respondents in most of these networks reported concurrent relationships for both sexual activity and, in those networks with injection drug use, for needle-sharing as well (Table 5). In general, those networks ascertained through contact tracing (for example, GC1981 and Manitoba, which used unmodified STD program approaches to finding contacts) tended to have lower levels of concurrency. Those obtained through purposive designs (for example, Project 90, Houston) had higher levels. There were several important exceptions, however. Most notably, the highest level of sexual concurrency was observed as a result of contact tracing among the 98 persons who constituted the Rockdale network, a group of teenagers who were involved in frequent parties with individual and group sex.23 Paradoxically, the highest levels of needle-sharing concurrency was observed in the Project 90 network—an area with low HIV prevalence.
On average, networks demonstrated lower levels of assortative mixing by degree (32%) and by age (28%) than by ethnicity (45%). These average values hide considerable diversity among the networks however (Table 6). Stratification by degree revealed no systematic differences in assortativity by age and ethnicity, suggesting that a person’s level of activity may be determined independently of his or her choice of partner (data not shown). Sociometric and egocentric assessments of assortativity produced almost identical results (variations of less than 1%), lending credence to the use of egocentric information for estimating assortativity (data not shown).
Unlike large networks such as the World Wide Web, wherein transitivity is highly prevalent, transitivity is much less common in sexual networks, in part because it cannot exist, by definition, in heterosexual relationships. Transitivity can be observed when social or drug-using (noninjecting) relationships are included (Table 6, “All contacts,” largest component), but its presence in sexual networks implies some degree of same-sex activity, and such transitivity was present in only 5 of the 15 networks examined here. On the other hand, needle sharing among 3 persons is fairly common, and transitivity was as high as 26% in 1 network (Table 6, “Houston,” second largest component).
Perhaps the clearest feature to emerge from this consideration of 15 completed network studies is their variability. Because their primary research questions varied as well, it is difficult to associate specific network properties with specific network outcomes, such as the incidence or prevalence of HIV or other STDs and BBIs. Some general hypotheses about network formation can be extracted from these data, however, most notably a sense of “fixed” factors and variable factors.
The term “fixed” is used advisedly in this context because results are not uniform and invariate, but certain patterns emerge nonetheless. With only several exceptions, the long right tail of the degree distribution in these studies may be fitted by a power law curve with an exponent between 2.0 and 3.0, the region for which the distribution is scale-free. If all the dyads from these studies are combined and the low degree persons removed, there is a clear straight-line relationship between the log (cumulative probability of degree) and the log (degree), with an exponent of 2.077, also in the scale-free distribution range (Fig. 1). On the other hand, several of the distributions (Flagstaff, Manitoba) are well outside this range, and the left side of many of these distributions (the portion not included in the calculation of the exponent) tended to be bumpy and irregular. It is likely that small network size permits substantial variation from what may be the basic underlying pattern.
The scale-free characteristic is of importance because it describes a portion of the curve for which the variance is theoretically infinite and produces a network with no epidemic threshhold.24 Several large scale studies have demonstrated such skewness. Morris used it to account for the discrepancy between mean partnerships reported by men and women but did not assess the characteristics of the distribution per se.25 Liljeros et al. examined the results of a Swedish national sex survey and were able to fit a scale-free distribution with an exponent of 2.54 for women and 2.31 for men. Similarly, Schneeberger examined 4 populations from large studies in Africa and England and found a similar scale-free distribution.26 Jones and Handcock24 challenged this approach for a number of statistical and operational reasons: sequential correlation of data points; heteroschedasticity along the power curve; exclusion of small values in order to make the curve fit; sensitivity to misreporting in the high-degree portion of the curve. They applied a maximum likelihood approach to similar data and found that all but 1 of 6 distributions fit a power law curve with the exponent in the 2-3 range. Liljeros et al.27 point out that because these measures apply to finite populations, the issue of infinite variance is moot, and the ratio of variance to mean may be more important in determining real world epidemic thresholds. At the very least, a “fat tail” to the right appears to be a constant and important feature of these small transmitting networks, and a direct correspondence with the phenomena observed in large networks may be of lesser importance.
However these concerns are ultimately resolved, the power law fit to degree distribution (or at least, the long tail to the right) is a critical element in the development of a large component within a network. In the giant social networks, the mathematical basis for the development of a large component from a random graph (Poisson) distribution has been explored.28 It is of considerable interest that most of the networks in the compilation have a single large component, and many smaller fragments. Study design clearly modifies the relative size of the largest component (from comparatively small in routine contact tracing data conducted in isolated communities [Manitoba] to a single giant component in an outbreak investigation of a closely knit group [Rockdale]). But the existence of a large component in most cases suggests that the underlying degree distribution has a predictable connection to the formation of such a group. In addition, a right-skewed degree distribution will produce nodes of high centrality that act as “short circuits” within a network and produce small world phenomena. In these data, the shortest paths and the diameters of the network are both evidence of small world characteristics, particularly when prorated for the overall size of the connected component in which they are being measured.
The right-skewed degree distribution, large component, and small world features may be thought of as the infrastructure for disease transmission. In keeping with Morris’ concept of local choices, individuals will decide on the number of sex partners they wish to have, and this number will have a distribution. In community situations of relatively monogamous (or serially monogamous) behavior, persons with high degree will be underrepresented or perhaps even absent, and the predominant network picture will be 1 of dyads, with occasional triads or longer dendritic connections. In situations, such as those that predominate among groups in which significant transmission has been detected, the degree distribution will include a long tail to the right, providing opportunities for the formation of large components and short circuits. Thus, the personal (“local”) choices on level of sexual activity within a community can lead to the network substrate for transmission. Specific behavioral acts, with differing transmission probabilities for HIV, STDs or BBIs, may also be thought of as personal choices that are part of the infrastructure for transmission. Though not considered here because of data limitations, we have proposed elsewhere that geographic distance is another of the local choices that may determine disease propagation.29
The actual amount of transmission may then be determined by other local choices, and the variability of such choices is reflected in these data. Concurrency, assortativity, and transitivity—which encompass the notions of partner sequencing and mixing that Morris describes11—are individual choices that, when taken in the aggregate, “fill in” the network structure and can provide the epidemiologic basis for transmission intensity. Concurrency and transitivity are obvious mechanisms for the amplification of transmission, as are higher-order structures in a network. Assortativeness will play a major role in determining the disease prevalence that a particular group may face. Other choices may be available as well, but those described here may be especially important as the conduits from micro- to macro-social phenomena.7 Taken together, fixed and variable factors are rooted in local choices and produce a global picture.
Perhaps the primary public health impact, at this point, is the ability to recognize networks in which substantial transmission can take place. Techniques for rapid network assessment have not been a prime focus of network research, but they can easily be extracted from the long experience with partner notification that is embedded in STD/HIV control programs. Rapid identification is a forerunner, in turn, to the use of networks as an element for disease surveillance, both to monitor known sites of transmission and detect new ones. Such applications will also serve as an empirical basis for detection of other network configurations that support transmission.
The data from these studies do not provide direct answers to the relationship of these network phenomena to disease transmission, but they do speak to the dynamics, and provide research agendas for further empirical, theoretical, and blended investigation. Blending is probably the critical approach, since empirical studies have certain important limitations. Interestingly, as this group of investigations demonstrates, one of the limitations is not our ability to get information from individuals about highly personal activity. Rather, the number of replications and the quantity of data required for a comprehensive, planned approach is daunting. Similarly, theoretical development in the absence of empirical verification is problematic. Blending these approaches is the current creative challenge in network research.
1. Freeman LC. Some antecedents of social network analysis. Connections 1996; 19:39–42.
2. Moreno JL. Who Shall Survive? Foundations of Sociometry, Group Psychotherapy, and Sociodrama. Washington, DC: Nervous and Mental Disease Publishing; 1934.
3. Pool IDS, Kochen M. Contacts and influence. Soc Networks 1978; 1:5–51.
4. Erdos P, Renyi A. On the evolution of random graphs. Publications of the mathematical institute of the Hungarian Academy of Sciences 1960; 5:17–61.
5. Frank O. A survey of statistical methods for graph analysis. In: Leinhardt SL, ed. Sociological Methodology. San Francisco, CA: Jossey-Bass; 1981.
6. Rapoport A. A probabilistic approach to networks. Soc Networks 1979; 2:1–18.
7. Granovetter MS. The strength of weak ties. Am J Sociol 1973; 78:1360–1380.
8. Moschovitis CJP, Poole H, Schuyler T, et al. History of the Internet. The Moschovitis Group. 1999. Available at: http://www.historyoftheinternet.com/index.html
9. Lyman P, Varian HR. How much information 2003. School of Information and System Management, University of California at Berkeley. Available at: http://www.sims.berkeley.edu/research/projects/how-much-info-2003/
. Accessed October 27, 2003.
10. Newman MEJ. The structure and function of complex networks. SIAM Rev 2003; 45:167–256.
11. Morris M. Local rules and global properties: Modeling the emergence of network structure. In: Breiger R, Carley K, Pattison P, eds. Dynamic Social Network Modeling and Analysis. Washington, DC: National Academy Press; 2003.
12. Barabasi AL, Albert R. Emergence of scaling in random networks. Science 1999; 286:509–512.
13. Pastor-Satorras R, Vespignani A. Epidemic spreading in scale-free networks. Phys Rev Lett 2002; 86:3200–3203.
14. Snijders TAB, Pattison P, Robins G, et al. New specifications for exponential random graph models. Center for Statistics in the Social Sciences, University of Washington, Seattle, WA. Available at: http://www.csss.washington.edu/Papers/wp42.pdf
. Accessed April 23, 2004.
15. Newman MEJ. Power laws, Pareto distributions and Zipf’s law. Available at: http://www.arXiv.orgarXiv:cond-mat/0412004v2
. Accessed January 9, 2005.
16. Borgatti SP, Everett MG, Freeman LC. UCINET for Windows: Software for Social Network Analysis. Natick, MA: Analytic Technologies; 2002.
17. Watts CH, May RM. The influence of concurrent partnerships on the dynamics of HIV/AIDS. Math Biosci 1992; 108:89–104.
18. Kretzschmar M, Morris M. Measures of concurrency in networks and the spread of infectious disease. Math Biosci 1996; 133:165–195.
19. Morris M, Kretzschmar M. Concurrent partnerships and transmission dynamics in networks. Soc Networks 1995; 17:299–318.
20. Morris M, Kretzschmar M. Concurrent partnerships and the spread of HIV. AIDS 1997; 11:641–648.
21. Newman MEJ. Assortative mixing in networks. arXiv:cond-mat/0205405 v1. Available at: http://www.arXiv.org
. Accessed May 20, 2002.
22. Morris M. A log-linear modeling framework for selective mixing. Math Biosci 1991; 107:349–377.
23. Rothenberg RB, Sterk C, Toomey KE, et al. Using social network and ethnographic tools to evaluate syphilis transmission. Sex Transm Dis 1998; 25:154–160.
24. Jones JH, Handcock MS. An assessment of preferential attachment as a mechanism for human sexual network formation. Proc R Soc London B 2002; 270:1123–1128.
25. Morris M. Telling tails explain the discrepancy in sexual partner reports. Nature 1994; 365:437–440.
26. Schneeberger A. Scale-free Networks and Sexually Transmitted Disease: A Description of Observed Patterns of Sexual Contacts in Britain and Zimbabwe. IMA Workshop 3: Networks and the Population Dynamics of Disease Transmission November 17–21, 2003. Minneapolis, MN; 2003. Available at: http://www.ima.umn.edu/complex/abstracts/11-17abs.html#schneeberger
27. Liljeros F, Edling CR, Stanley HE, et al. Distributions of number of sexual partnerships have power law decaying tails and finite variance. arXiv.org
e-Print archive. 2003. Available at: http://www.arxiv.org/abs/cond-mat/0305528
28. Newman MEJ, Strogatz SH, Watts DJ. Random graphs with arbitrary degree distributions and their applications. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 2001; 64:026118.
29. Rothenberg RB. Transmission of HIV and STDs: An hypothesis for the interaction of risk, network configuration and geography. Sex Transm Infect, in press.