Ecologists and conservation biologists often need to quantify the diversity of some community of animals and plants. In the tropics, where there are many rare species, this can be hard to do, because there are so many rare species; a random sample from the population will miss many species. My friends Phil DeVries, Tom Walla, and Harold Greeney were still finding additional species even after five years spent systematically sampling the butterfly diversity of a single research site in the Amazon basin of Ecuador. This is typical of tropical habitats. So biologists need the help of mathematicians and statisticians to help estimate the true diversity based on incomplete samples.
The person who has done the most to advance this aspect of statistical theory is Anne Chao, who I am lucky to have her as a friend and coauthor. She has figured out how to make use of information in the sample to tell us how incomplete it is, even when we don’t know how many species there are in the actual population. I have written about this subject before, when she and I figured out how to compare the diversities of two or more ecosystems based on incomplete samples from each of them (with a version in Spanish here). This is a common task, but biologists had always been doing it wrong. That post was about estimating the number of species in each community. However, there are many reasons why that number is not the best measure of biological diversity or complexity. Not only is it the hardest number to estimate accurately (because there are always unseen species), it also ignores the relative abundances of species, and those abundances have an important effect on the ecosystems’s complexity. An ecosystem with one very abundant species of birds and nine very rare species is less complex than an otherwise-similar ecosystem with ten equally common species of birds. An individual bird in the first ecosystem can be quite sure that the next bird it meets will be a member of the very abundant species. A bird in the second ecosystem will be completely uncertain about which of the ten species it will encounter next.
The amount of uncertainty in the species identity of the next individual encountered is actually something that can be quantified exactly if we know the relative abundances of each species in the ecosystem. In 1948 Claude Shannon, the inventor of information theory, showed that this uncertainty is equal to the entropy function that had been derived by the physicists Boltzmann and Gibbs in the late 1800s. Shannon’s discovery led to the wide use of entropy in other disciplines, including ecology. Shannon’s entropy measure became the most commonly used abundance-sensitive measure of biodiversity. The entropy is a simple function of the relative abundances of each of the species present in the population. Ten years ago I showed that this entropy function needs to be transformed by taking its exponential before it can be interpreted as diversity (Jost 2006, 2007, 2010). But whether we are interested in entropy or its exponential, in biology we still have to estimate these quantities from incomplete samples, so we don’t know the true relative abundances of the species in the population, and we don’t even know exactly how many species there are in the population.
Anne had been thinking about this problem for more than thirty years, and recently came up with a beautifully elegant solution for estimating entropy from small samples. The solution makes clever use of the information contained in the sample about the unseen species in the population. I wrote briefly about it here. This month she was asked by the journal Methods in Ecology and Evolution to write a blog post about her solution. You can read her full blog post here, complete with essential links to more information. Here are some excerpts, with my notes in brackets:
Estimating Entropy from Sampling Data
In practice, the true number of species and their relative abundances are almost always unknown, so the true value of Shannon entropy must be estimated from sampling data. The estimation of this seemingly simple function is surprisingly difficult, especially when there are undetected species in the sample. It’s been proven that an unbiased estimator for Shannon entropy doesn’t exist for samples of fixed sizes.
The observed entropy of a sample or the ‘plug in’ estimator, which uses a sample fraction [the abundance of the species in the sample, divided by the size of the sample] in place of the [true] relative abundance of species [in the population], underestimates the entropy’s true value. The magnitude of this negative bias can be substantial.
For incomplete samples, the main source of the bias comes from the undetected species, which are ignored in the plug-in estimator. An enormous number of methods/approaches have been proposed in various disciplines to obtain a reliable entropy estimator with less bias than that of the plug-in estimator. The diversity of the approaches reflects the wide range of applications and the importance of bias-reduction.
My Introduction to Alan Turing’s Statistical Work
Around 1975 (when I was a graduate in the Department of Statistics, University of Wisconsin-Madison) my advisor at the time, Bernard Harris, suggested that an “attractive and absorbing” (his original description) PhD thesis topic would be to develop an ‘optimal’ entropy estimator based on sampling data. He thought Alan Turing’s statistical work might prove to be useful and hoped that I could tackle this estimation problem.
However, at that time I didn’t even know who Alan Turing was! Although I started to read two background papers by I. J. Good (links below) about Turing’s statistical work, I couldn’t fully digest the material in the short time available. So, I didn’t work on the entropy estimation problem for my PhD thesis; instead, I derived some lower bounds for a variety of diversity measures. Ever since then, however, entropy estimation has fascinated me and has been in my mind/thoughts, and I regarded it as my ‘unfinished thesis’ topic.
The Building Blocks of My Entropy Estimators
According to I. J. Good (Turing’s statistical assistant during World War II), Turing never published his wartime statistical work, but permitted Good to publish it after the war. The two influential papers by Good (1953) and Good and Toulmin (1956) presented Turing’s wartime statistical work related to his famous cryptanalysis to crack German ciphers. After graduation, I read these two papers many times and searched for more literature. It took me a long time to fully understand these two papers especially Turing’s statistical approach to estimating the true frequencies of rare code elements (including still-undetected code elements), based on frequencies in intercepted ‘samples’ of code.
The frequency formula is now referred to as the Good-Turing frequency formula. Turing and Good discovered a surprisingly simple and remarkably effective formula that is contrary to most people’s intuition. The formula proved to be very useful in my development of entropy estimators.
One important idea derived from the Good-Turing frequency formula is the concept of ‘sample coverage’. Sample coverage is an objective measure of the degree of completeness of the intercepted ‘samples’ of code elements. The ‘sample coverage’ of a sample quantifies the proportion of the total individuals in the assemblage that belong to sampled species. Therefore, the ‘coverage deficit’ (the complement to sample coverage) is the probability of discovering new species, i.e. the probability that a new, previously-unsampled species would be found if the sample were enlarged by one individual. Good and Turing showed that for a given sample, the sample coverage and its deficit can be accurately estimated from the sample data itself. Their estimator of coverage deficit is simply the proportion of singletons (in this case species with only one individual) in the observed sample. This concept and its estimator play essential roles in inferring entropy.
A Novel Entropy Estimator
A species accumulation curve (SAC) shows the cumulative number of species as a function of sample size. In the figure below we see the expected curve when individuals are sequentially selected from a community with a given number of species, with relative abundances.
The first breakthrough in my search for an estimator of Shannon entropy was the realization that entropy can be expressed as a simple function of the successive slopes of the SAC. [Anne is probably the only person in the world who would have noticed this!] The curve’s successive slopes show the rates at which new species are detected in the sampling process. I had found a novel way to estimate entropy via discovery rates of new species in a SAC and these rates or slopes are exactly Turing’s coverage deficits for varying sample sizes!
The statistical problem was then to estimate the expected slopes or coverage deficits for any hypothetical sample size. Good and Turing’s approach provided the coverage deficit estimator for the expected slope of the sample that has been taken. All of the expected slopes for smaller sample sizes can be estimated without bias from statistical inference theory. However, there is no unbiased estimator for the expected slopes for sample sizes greater than the sample taken. These slopes are usually dominated by rare undetected species whose effect on entropy cannot be ignored. So, the burden of entropy estimation is shifted onto the estimation of the expected slopes for sizes greater than our sample.
The second break-through step to solve this problem was also attributed to the wisdom of Turing and Good, who showed that the number of singletons carry much information about the number of undetected rare species. I slightly modified their idea to use both singletons and doubletons to estimate the hard-to-estimate slopes by my modified Good-Turing frequency formula.
With the collaboration of Lou Jost and the simulation/programming work of Y.T. Wang, we published in 2013 the novel entropy estimator based on the derived slope estimators [open access to full text]. Our extensive simulations from theoretical models and real surveys generally showed that the new estimator outperformed all the existing estimators. It took me over 35 years to derive the optimal estimator for my ‘unfinished thesis’, so I have been calling it my entropy ‘pearl’. (The novel entropy estimator along with other related estimators can be calculated online.)
Doing Research is like Carving Jade
…As the old saying goes: “doing research is like carving jade, we are never satisfied with what we have until it is perfect”. This is also my advice to anyone starting their career in academia. The topic of entropy estimation has attracted and absorbed me for more than 35 years, and hopefully the novel estimator did yield an ‘optimal’ solution, if it’s still not perfect.
In her blog post Anne also explains how this method generalizes to the estimation of a wider class of diversity measures based on generalized entropy, a problem she and I had been working on for ten years. See her original blog post for details.
Anne’s contributions to the mathematics of biology are one of the reasons why Sebastian Vieira and I recently named a new orchid after her. Thanks, Anne, for your very fruitful collaborations over the years!
Chao, A. and Jost, L. (2011). Diversity measures. Sourcebook in Theoretical Ecology (eds. A. Hastings and L. Gross). Berkeley: University of California Press.
Chao, A. and Jost, L. (2012) Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology 93: 2533-2547.
Chao, A. and Jost, L. (2015) Estimating diversity and entropy profiles via discovery rates of new species. Methods in Ecology and Evolution 6: 873–882.
Chao A, Wang YT, Jost L (2013) Entropy and the species accumulation curve: a novel entropy estimator via discovery rates of new species. Methods in Ecology and Evolution 4: 1091-1100.
Jost, L. (2006) Entropy and diversity. Oikos 113: 363–375.
Jost, L. (2007) Partitioning diversity into independent alpha and beta components. Ecology 88: 2427–2439.
Jost, L. (2009) Mismeasuring biological diversity: Response to Hoffmann and Hoffmann (2008). Ecological Economics 68: 925–928.