Department of Endocrinology, Ghent University Hospital, Gent, Belgium, MedCalc Software bvba, Mariakerke, Belgium, email@example.com (Schoonjans)
Department of Public Health, Ghent University, Gent, Belgium (De Bacquer)
Department of Transfusion Medicine, Clinical Center, National Institutes of Health, Bethesda, MD (Schmid)
Supported by The National Institutes of Health, Bethesda, MD through a fellowship (to P.S.).
Disclaimer: The views expressed do not necessarily represent the view of the National Institutes of Health, the Department of Health and Human Services, or the U.S. Federal Goverment.
Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidemn.com).
To the Editor:
Percentiles play an important part in descriptive statistics of continuous data, and their use is recommended for reference interval estimation.1 We have selected various methods for the calculation of percentiles based on recommendations in the literature or use in popular software, and evaluated the accuracy of the percentile calculated in the sample as an estimate of the true population percentile,2 using Monte-Carlo techniques.
All selected methods calculate a rank or an index that points to a number in the sorted array of sample data, and linear interpolation is applied when the index does not correspond to an integer value. One method (method A1,3) calculates a rank or an index p(n + 1) with p representing the centile (which is the percentile divided by 100) and n the sample size. Method B4 calculates an index 0.5 + pn. Method C5 (commonly used in spreadsheets) uses p(n − 1) + 1 and method D6 uses p(n + 1/3) + 1/3. Details of the use of these 4 methods are given in the eAppendix (http://links.lww.com/EDE/A488).
Experimental population data were obtained using a normal distribution pseudorandom number generator, programmed to generate a data set of 106 numbers with mean 0 and standard deviation 1.
From our population data, 6 sets of 100,000 random samples each were drawn using a pseudo-random number generator with uniform distribution. Each of these sets consisted of 100,000 random samples with sample size 20, 120, 500, and 1000. The average of the 5th and 95th percentiles obtained with the 4 methods in these sample sets were calculated and compared with the population values. For each sample, the relative difference with the population values was expressed as a percentage, and the mean and standard deviation of these percentages were calculated.
Next, the population data were transformed exponentially (base 10) to obtain a log-normal distribution and the experiments as described earlier were repeated.
The results for the 95th percentile in the normally distributed data are represented in the Table (more comprehensive tables with figures are available in the eAppendix, http://links.lww.com/EDE/A488). The results for the 5th percentile were symmetrical to the results for the 95th percentile and are not shown. Method B presents the highest accuracy, followed by method D, A, and C.
The results for the 5th and 95th percentile in the log-normally distributed data are represented in the Table. For the 5th percentile, method A has a higher accuracy than the methods D, B, and C, especially in small sample sizes, whereas for the 95th percentile method C presents the highest accuracy, followed by method B, D, and A.
We find that, for the calculation of percentiles, it may still be advantageous to transform log-normally distributed data. For example, the 95th percentile in the log-normal data should be about 44.16 (101.65). With method B and n = 20, we find an average of 114.69. But if we first log-transform the data we find, on average, 1.64, which back-transforms to 101.64 or 43.65, which is much closer to the true population value of 44.16. The effect of the log-transformation may be explained by the fact that linear interpolation is applied in the calculations of percentiles, and the transformation changes the distribution model within the interpolated interval.
We conclude that method B is the preferred method in general for continuous data, taking into account the recommendation to transform the data to a normal distribution if necessary.
Finally, the large standard deviations of the observed differences illustrate the large statistical uncertainty associated with the estimated percentiles in small sample sizes. Therefore, we stress the importance of reporting percentiles with their 95% confidence interval.
Department of Endocrinology
Ghent University Hospital
MedCalc Software bvba
Dirk De Bacquer
Department of Public Health
Department of Transfusion Medicine
National Institutes of Health
1.CLSI. Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory: Approved Guideline
. 3rd ed. Wayne, PA: Clinical and Laboratory Standards Institute; 2008. CLSI Document C28-A3.
2.Walter S. Problems with percentiles. Int J Epidemiol
3.Altman DG. Practical Statistics for Medical Research
. London: Chapman and Hall; 1991.
4.Armitage P, Berry G, Matthews JN. Statistical Methods in Medical Research.
4th ed. Oxford: Blackwell Science; 2002.
5.Gumbel EJ. La Probabilité des Hypothèses. Comptes Rendus de l'Académie des Sciences (Paris)
6.Hyndman RJ, Fan Y. Sample quantiles in statistical packages. Am Stat