Distribution fitting
Methods of fitting data to abundance models are contentious, with many protocols having been advocated (Matthews & Whittaker, 2014). As mentioned, all of the models considered here seek to explain SADs sensu stricto, which are vectors that record the number of species each sharing a given count of individuals (Fisher et al., 1943).
It is very important to stress that SADs are not equivalent to rank abundance distributions (RADs). These are useful for depicting counts (e.g., Motomura, 1932, MacArthur, 1957) and are commonly used even by some contemporary workers to fit distribution models (e.g., Nekola et al., 2008; Ulrich et al., 2010, 2015). There are at least four major reasons not to fit data to RADs: (1) key theoretical models directly predict SAD shapes, not RADs; (2) most models that do directly predict RADs, such as the geometric series (Motomura, 1932), are no longer considered to be viable descriptors of real-world ecological communities (Alroy, 2015; Baldridge et al., 2016); (3) maximum likelihood methods have been developed to fit models to SADs (e.g., Connolly et al., 2005; Connolly & Thibaut, 2012) and are generally advocated over the many alternatives (Gray et al. 2006; Whittaker & Matthews, 2014; Antão et al., 2021), but RADs are generally fit using frequentist methods; and (4) it is difficult to model the error in species ranks because any variation in the count of a species could also change its rank, so the x- and y-axes in an RAD are not statistically independent.
A third approach is also worth mentioning: to bin the counts into classes on a log scale, equivalent to a histogram where the boxes show the number of species in classes 1, 2, 3 – 4, 5 – 8, 9 – 16, etc. (Preston, 1948). This strategy is still used (e.g., Matthews et al., 2014), but it has rightly been rejected because it loses too much information and can introduce artefacts (Gray et al., 2006; Nekola et al., 2008).
Here I use a fast and reliable maximum likelihood computation for fitting. It is the most obvious approach: define the likelihood by multiplying the probabilities of the individual counts based on the SAD. Specifically, if there are s 1 singletons ands 2 doubletons out of S species and if the hypothesised PMF is p 1,p 2, p 3…, then the joint likelihood is p 1s 1 xp 2s 2….
This calculation works as well in practice as any other I have investigated, surpassing rivals in a suite of tests that I do not have space to detail. It has a very interesting property: exactly the same solution is always found by fitting a given set of counts to a multinomial model. The reason is that the combinatorial terms distinguishing multinomial distributions from simple products of probabilities are constant across all possible parameter values, so they wash out of any optimisation.
All models considered here use just one free parameter. However, many of the remaining models in the literature assume two parameters (such as the PLN). For comparison across models in general, the corrected Akaike information criterion (Hurvich and Tsai, 1993) is recommended. It has been used previously in this context (Antão et al., 2021).
In practice, analysing a large data set requires examining a limited set of classes. Here, the computational limit is treated as 214 = 16,384. Imposing this cutoff makes hardly any difference because just 45 out of 82,870 species counts in the database (0.05%) exceed it.