What is the migration of individuals between populations?

  • Journal List
  • HHS Author Manuscripts
  • PMC1482677

Evolution. Author manuscript; available in PMC 2007 Jan 1.

Published in final edited form as:

Evolution. 2006 Jan; 60[1]: 1–12.

PMCID: PMC1482677

NIHMSID: NIHMS8481

PMID: 16568626

DIFFERENTIATION AMONG POPULATIONS WITH MIGRATION, MUTATION, AND DRIFT: IMPLICATIONS FOR GENETIC INFERENCE

Seongho Song,1,2 Dipak K. Dey,1 and Kent E. Holsinger3,4

Seongho Song

1Department of Statistics, U-4120, University of Connecticut, Storrs, CT 06269-4120

Find articles by Seongho Song

Dipak K. Dey

1Department of Statistics, U-4120, University of Connecticut, Storrs, CT 06269-4120

Find articles by Dipak K. Dey

Kent E. Holsinger

3Department of Ecology & Evolutionary Biology, U-3043, University of Connecticut, Storrs, CT 06269-3043

Find articles by Kent E. Holsinger

Disclaimer

1Department of Statistics, U-4120, University of Connecticut, Storrs, CT 06269-4120

3Department of Ecology & Evolutionary Biology, U-3043, University of Connecticut, Storrs, CT 06269-3043

2current address: Department of Mathematical Sciences, ML:0025, University of Cincinnati, Cincinnati, OH 45221-0025

4Address for correspondence: E-mail: ude.nnocu.bee.niwrad@tnek

Copyright notice

The publisher's final edited version of this article is available free at Evolution

Abstract

Populations may become differentiated from one another as a result of genetic drift. The amounts and patterns of differentiation at neutral loci are determined by local population sizes, migration rates among populations, and mutation rates. We provide exact analytical expressions for the mean, variance, and covariance of a stochastic model for hierarchically structured populations subject to migration, mutation, and drift. In addition to the expected correlation in allele frequencies among populations in the same geographical region, we demonstrate that there is a substantial correlation in allele frequencies among regions at the top level of the hierarchy. We propose a hierarchical Bayesian model for inference of Wright's F-statistics in a two-level hierarchy in which we estimate the among-region correlation in allele frequencies by substituting replication across loci for replication across time. We illustrate the approach through an analysis of human microsatellite data, we show that approaches ignoring the among region correlation in allele frequencies underestimate the amount of genetic differentiation among major geographical population groups by approximately 30%, and we discuss the implications of these results for the use and interpretation of F-statistics in evolutionary studies.

Keywords: F-statistics, genetic drift, migration, mutation, population structure

Sewall Wright pointed out more than 70 years ago that isolated populations tend to diverge from one another as a result of genetic drift and that the amount and pattern of divergence among populations reflects the extent of migration between them []. In particular, individuals belonging to the same population are expected to be more similar to one another than those belonging to different populations, and individuals from different populations within the same geographical region are expected to be more similar to one another than are individuals from different geographical regions. As a result, evolutionary biologists have commonly used statistics that describe patterns of population differentiation as an indication of the extent to which populations are genetically isolated from one another. Wright's F-statistics, and statistics related to them like Nei's G-statistics [], have been the most widely used descriptors of population differentiation for more than fifty years [; ].

While F-statistics are intended to measure the amount of differentiation among populations, they are implicitly based on Wright's island model [], a pseudo-spatial model in which all populations are equally likely to exchange migrants. Much work in theoretical population genetics has been directed towards understanding the patterns of genetic differentiation that arise when more realistic models of migration are incorporated into the models [e.g., ; Kimura and Weiss 1964; Nei and Feldman 1972; Felsenstein 1975; Nagylaki 1976; ; ; ; ; ]. Many investigators now calculate pairwise F-statistics to provide an indicator of the amount of differentiation between those populations [e.g., ].

Another aspect of Wright's model has been less widely appreciated. His model implies that allele frequencies among populations at any given point in time are uncorrelated when the number of populations exchaning genes is very large. As a result, the variance in allele frequency among contemporaneous populations is equivalent to the variance in allele frequency within any one population over time. Surprisingly, this result no longer holds when a small to moderately large number of populations are exchanging genes.

show that substantial among-population correlations in allele frequency are expected even for mutation rates as large as 10−3 unless more than 100 populations are exchanging genes. Their results re-emphasize an obervation made by : When a finite number of populations is involved in gene exchange, the entire set of populations “drifts” together. The mean allele frequency changes from generation to generation, and the resulting correlation in allele frequencies among populations may be substantial unless the number of populations exchanging genes is very large. As a result, the variance in allele frequency among contemporaneous populations may be substantially smaller than the variance in allele frequency within any one population over time.

We explore two questions raised by these results: [1] How is the magnitude of among population correlation affected by hierarchical structure in the migration process? How does the magnitude of the within region correlation compare with the magnitude of the among region correlation? By hierarchical structure we mean that there is a higher rate of migration among populations within a region than among populations in different regions. [2] Given that the number of populations actually exchanging genes is larger than the number of populations from which samples are available and is generally unknown, can an approximate estimator that does not depend on the number of populations provide a reasonable estimate of the expected amount of differentiation among populations? After exploring these questions, we illustrate the importance of accounting for among population correlations in estimates of population genetic structure through analysis of a human population data set [] with a hierarchical Bayesian model, and we discuss the implications of our results for genetic analyses of population structure.

PROCESS MODEL RESULTS

Analysis of population genetic data must incorporate two aspects of sampling: the usual statistical sampling associated with the sampling of alleles or genotypes within populations and the genetic sampling associated with sampling population allele or genotype frequencies from an underlying stochastic evolutionary process [see , pp. 15–17 for additional discussion]. We use the modeling framework introduced in for the stochastic evolutionary process.

Briefly, we focus on a single locus with A allele types, b 1, b 2, ⋯, b A, and assume that there are P populations indexed by i [refer to Table 1 for a summary of notation used in the process model]. Let VA×A, be a general mutation matrix which has a element, v rs, the probability of mutation from allele type b r to allele type b s. Let MP×P be a general [backward] migration matrix, i.e., Mij=m↼ij is the probability that the allele in population i came from population j. Backward migration rates are generally used for analysis of migration in population genetic models [see, for example, Nagylaki 1982; , ].

Table 1

Notation used in the single locus process model.

ParameterDefinitionANumber of allelesSNumber of populations in each geographical regionkNumber of geographical regionsPTotal number of populations. P = kSVAn A × A matrix of mutation rates. v rs is the mutation rate from the rth allele to the sth alleleMA P × P matrix of backward migration rates. m ij is the probability that an allele in population i came from population j in the preceding generationBThe Kronecker product of M and VN ijNumber of individuals in population j of geographical region ipij[t]The vector of allele frequencies in population j of geographical region i at time t. pij,r[t] is the frequency of the rth allele.Properties when the process has reached stationarity.uijThe mean vector of allele frequencies in population j of geographical region i.Σij,ijThe variance-covariance of allele frequencies within population j of geographical region i.Σij,i′j′The covariance of allele frequencies between population j of geographical region i and popogulation j′ of geographical region i′.Finite island modelm 0Probability that an allele came from the same population in the preceding generation.m 1Probability that an allele came from a different population in the same geographical region in the preceding generation.m 2Probability that an allele came from a different geographical region in the preceding generation.F-statisticsFST=σp2μp[1−μp]Wright's F ST, where σp2 is the variance of allele frequencies and μp is the mean allele frequency.θAn interpretation of F ST that arises when μp is taken as the mean allele frequency and σp2 is taken as the allele frequency variance in a single population over time.θ[p11[t],…,pkS[t]]An interpretation of F ST that arises when μp is taken as the mean allele frequency and σp2 is taken as the allele frequency variance across populations at a single time.

Open in a separate window

A multi-level hierarchy arises naturally when migration occurs more frequently among populations within the same geographical region than among populations in different geographical regions. Consider a 2-level hierarchy in which there are S populations nested within each of k geographical regions [kS = P]. Assume that population j in geographical region i is of size N ij, and let pij[t] be the A × 1 vector of allele frequencies in that population at generation t. We concatenate the population allele frequency vectors within geographical region i to an SA × 1 vector pi[t] and concatenate the k resulting cluster vectors to a kSA × 1 vector p[t]. We define

p∗[t] = [M ⊗ V′]p[t] , 

[1]

where M ⊕ V′ denotes the Kronecker product of M and V′. Given N ij, we assume that the population is diploid [so that the number of allele copies is 2N ij] and that the pij[t+1] are conditionally independent. Thus,

2Nijpij[t+1]∼M[2Nij,pij∗[t]],

[2]

where M denotes a multinomial distribution.. Through [1] and [2], we pass from p [t] → p*[t] → p [t+1].

Stationary equations for means and covariances

The Markov Chain defined through [1] and [2] has a finite state space. If all entries of V′ are nonzero for some t ≥ 1, then this chain has no absorbing states. In fact, it is aperiodic and irreducible and thus has a unique stationary distribution [see for an introduction to stationary distributions in population genetic models]. The stationary mean vector is identical in all populations and is given by the unique left eigenvector of V corresponding to an eigenvalue of 1. The stationary variance-covariance matrix is given by the solution of the following set of equations:

Σij,ij=[1−12Nij][BΣB′]ij,ij+12Nij[Diag[uij]−uiju′ij]Σij,i′j′=[BΣB′]ij,ij,

[3]

where B = M ⊕ V′ [see for details of the derivation, which follows ].

Results for a Finite Island Model

To make further analytical progress we consider a special case of the general structured migration model presented in the preceding section. The finite island model studied in imagined that a single backward migration rate applied to all populations. A natural generalization is to consider a hierarchical model in which migration occurs among populations within a single geographical region at the same rate but in which migration occurs among populations in different geographical regions at a different [and smaller] rate. For notational simplicity, we consider a 2-level hierarchy in which there are S populations within each of k geographical regions. We specify the migration matrix, M as follows:

M=[M1M2…M2M2M1…M2⋮⋮⋱⋮M2M2…M1],

[4]

where

M1=[m0m1S−1…m1S−1m1S−1m0…m1S−1⋮⋮⋱⋮m1S−1m1S−1…m0]

[5]

and

M2=[m2S[k−1]m2S[k−1]…m2S[k−1]m2S[k−1]m2S[k−1]…m2S[k−1]⋮⋮⋱⋮m2S[k−1]m2S[k−1]…m2S[k−1]],

[6]

with m 0 = 1 − m 1 − m 2, and m 1 > m 2. The remainder of the formulation is unchanged.

Lengthy calculations [outlined in ] provide exact expressions for the stationary mean vector and for the stationary variances and covariances. To illustrate the properties of this specialization we consider an example with two alleles and asymmetric mutation.

Example

With two alleles at a locus, the mutation matrix is given by

V=[v11v12v21v22],

where v 11 = 1 − v 12 and v22 = 1 − v 21. This mutation matrix leads to the stationary mean vector u = [μp, 1 − μp], where μp = ν21/[ν12 + ν21]. In addition,

σ2=12Nμp[1−μp]1−[1−12N]δv2[[1−r1,S−r2,k+r3,k,s]+[r1,S−r3,k,S]ρ1+r2,kρ2],

where δv = v 11 − v 21 = 1 − [ν12 + v 21], σ2 is the variance in allele frequency within any one population over time, ρ1 is the allele frequency correlation for populations in the same geographical region, and ρ2 is the allele frequency correlation for populations in different geographical regions. ρ1 and ρ2 are given by

ρ1=δv2[r1,S−r3,k,S][1−δv2[1−r2,kk−1]]c[S−1]+δv4r2,k2cS[k−1]ρ2=r2,kδv2[1−δv2[1−r1,S−r3,k,SS−1−r2,k]]cS[k−1]+[r1,S−r3,k,S]r2,kδv4cS[k−1]

where

c=[1−δv2[1−r2,kk−1]][1−δv2[1−r1,S−r3,k,SS−1−r2,k]]−[S−1]δv4S[k−1]r2,k2r1,S=2m1−SS−1m12r2,k=2m2−kk−1m22r3,k,S=2m1m2−S−1S[k−1]m22.

In the case where each geographical region contains only a single population, i.e., S = 1, these results reduce to those presented in for the finite island model.

Figure 1 illustrates the magnitude of ρ1 and ρ2 for several parameter combinations. Notice that with mutation rates that may be typical of protein-coding loci [v = 5 × 10−7], both the correlation of allele frequencies among populations within a geographical region, ρ1, and the correlation among geographical regions, ρ2, are very high unless either the number of populations within each region, S, or the number of regions K is on the order of 200 or more. Moreover, the correlation decays more slowly as a function of increasing numbers of populations and regions when migration is common [m 1 = 0.04, m 2 = 0.01] than when it is rare [m 1 = 0.008, m 2 = 0.002]. Notice also that the within-region correlation is relatively insensitive to the number of regions. It depends largely on the number of populations within each region. Similarly, the among-region correlation depends primarily on the number of regions and less on the number of populations within regions. Finally, with mutation rates that may be more typical of microsatellites [v = 0.005] the within-region correlation remains substantial with as many as 40 populations per region and the among-region correlation is small only when migration is rare [m 1 = 0.008, m 2 = 0.002].

Open in a separate window

Figure 1

Plots of the correlation among populations within the same geographical region, ρ1, and among geographical regions, ρ2, as a function of the number of geographical regions, k, and the number of populations within each region, S, for several choices of migration rate with two mutation rates in the two-allele model.

INTERPRETATION FOR F-STATISICS

In general, the expressions for the stationary variance depend on the stationary mean. We can, however, calculate a scaled variance,

FST=σp[t]2μp[t][1−μp[t]],

[7]

that removes this dependence. In this case F ST can be regarded as an intraclass correlation coefficient [; ]. In fact, , p.294] defined F ST as “the correlation between random gametes within populations, relative to gametes of the total set of populations.” Since it was first introducted by and , F ST has been the most widely used statistic for describing hierarchical structure in genetic data.

For a finite set of populations evolving according to [1] and [2] there are two quantities that might correspond with [7]. [Refer to Table 2 for a summary of notation used in relation to F-statistics.] One of these quantities, which we will denote θ, takes σp[t]2 to be the allele frequency variance in a single population across time [or equivalently as the allele frequency variance in a single population associated with different realizations of the stochastic evolutionary process]. The second takes σp[t]2 to be the allele frequency variance across populations at a single point in time, i.e.,

θ[p11[t],…,pkS[t]]=∑i=1k∑j=1S[pij[t]−μp[t]]2∕kSμp[t][1−μp[t]],

[8]

where μp[t]=∑i=1k∑j=1Spij[t]∕kS.

Table 2

Notation used in the two-allele, multiple locus hierarchical beta model.

ParameterDefinitionθ[I]An estimate of θθ[II], θ[III]Alternative estimates of θ[p11[t],…,pkS[t]]INumber of loci [two alleles per locus]KNumber of geographical regionsS kNumber of populations in geographical region kp iksFrequency of allele A 1 at locus i in population s of geographical region kπikMean allele frequency at locus i in geographical region kπiMean allele frequency at locus iπMean allele frequency across lociπik[1 − πik]θxVariance in allele frequency at locus i within geographical regionsπi[1 − πi]θyVariance in allele frequency at locus i among geographical regionsπ[1 − π]θvVariance in allele frequency across lociρ1Allele frequency correlation among populations within geographical regionsρ2Allele frequency correlation among geographical regions

Open in a separate window

We use θ in our notation to emphasize that both quantities treat populations as random effects, as in Cockerham's random-effects model for θ-statistics [; ; ; ]. In particular, θ in our formulation has precisely the same interpretation as θ in the Cockerham random-effects model [Weir, personal communication; ]. Moreover, θ is the parameter most directly related to features of the allele frequency distribution associated with the process model described above, as well to allele frequency distributions arising from the pure isolation model underlying evolutionary interpretations of Weir and Cockerham's θ-statistics [; ; ]. Thus, we regard θ as the quantity that corresponds to Wright's F ST. Notice that θ[p11[t],…,pkS[t]] corresponds to a random effects interpretation of G ST [1973].

Estimation of θ is straightforward. Given a sample of genetic data from some set of populations we estimate μp[t] and σp[t]2 from the data, and the estimate corresponding to θ is θ[I]=σ^p[t]2μ^p[t][1−μ^p[t]]. Notice that if the number of populations and population clusters is very large, then the allele frequency correlations illustrated in Figure 1 are negligible, and θ≈θ[p11[t],…,pkS[t]] [see also ]. Recalling that k and S in [8] refer to the number of population clusters and populations among which migration is occurring, not the number of populations included in the sample, an estimate of θ[I] might also be regarded as an estimate of θ[p11[t],…,pkS[t]] when ρ1 and ρ2 are small.

Estimation of θ[p11[t],…,pkS[t]] is less straightforward. Let Num denote the numerator on the right hand side of [8], and Denom the corresponding denominator. θ[III] = E[Num/Denom] would correspond precisely with the definition of θ[p11[t],…,pkS[t]]. To estimate θ[III], however, would require knowledge of k and S, and they are generally unknown. An alternative is to estimate θ[p11[t],…,pkS[t]] with θ[II] = E[Num]/E[Denom], which is is relatively insensitive to k and S for moderate values. In fact, we can show that θ[II] → θ[III] in probability as k and S tend to infinity. Furthermore, by using θ[II] to estimate θ[p11[t],…,pkS[t]], we can provide an approximate expression for θ[II] that does not depend on k or S [see ].

Tables Tables33 and and44 provide comparisons among θ[I], θ[II] and θ[III] for 2N = {100, 1000}, v = v 12 = v 21 = {5 × 10−3, 5 × 10−6}, and several choices of m 1 and m 2 with m 1 + m 2 = 0.05. We calculate the expectations for θ[III] from simulations using the process model in [1] and [2] conditional on 0

Chủ Đề