Comparing different genotypic interpretation systems is of interest to improve knowledge of the relationship between resistance mutations and virological response under a specific drug. The motivation for choosing a specific system should be based on comparison using appropriate statistical methods. There is some confusion as to the correct use of the linear regression model with its R2 measure in this setting, especially when models using a single discrete variable are compared with a model using many variables.
INSERM, UMR S 720, Paris, F-75013, and Université Pierre et Marie Curie-Paris 6, UMR S 720, Paris, F-75013, France.
Received 14 July, 2006
Accepted 9 August, 2006
Drug resistance testing has proved its use to guide treatment decisions in HIV-1-infected patients. In particular, genotyping has been widely investigated to help in the choice of subsequent treatment for experienced HIV-infected patients. Different genotypic drug resistance interpretation systems, or algorithms, have been developed in recent years [1–3]. The main idea is to propose a genotypic mutations score allowing the classification of patients as resistant, potentially resistant, or sensitive for each approved drug. Different statistical methods can be employed to develop such scores, but the final score is mainly obtained as the sum of several mutations involved in the virological response to the corresponding drug . The resulting score of these algorithms is then a discrete variable taking some values.
Obviously there is great interest in comparing these different algorithms. More recently some apparently more complicated statistical methods, including artificial neural networks (ANN), support vector machines and random forest, have been suggested to predict virological response . The upholders of such ‘computational modelling’ have argued that there is no standard interpretation system and a lack of consensus, but mainly that those algorithms have a poor prediction compared, for example, with ANN. The basis of the comparison was the linear measure of prediction, the so-called R2 measure, defined as the ratio of the sum of squares (SS) caused by the regression to the total SS. The basis of the comparison, however, is not valid.
Our demonstration is based on only 27 patients experiencing a treatment change episode . The information consists of viral load reduction (VLR), and genotypic sensitivity scores (GSS) from three rule-based algorithms [Stanford, Agence Nationale de Recherches sur le Sida (ANRS), and Rega] . Figure 1 shows the data and the result of the three linear regression models. In comparison, the ANN model exhibits an r2 of 0.75, suggesting a clear superiority of this method.
In the original presentation, a trick was to put the VLR in the x-abscissa and the GSS on the y-ordinate, while the independent variable (GSS) is usually on the x-abscissa and the dependent variable (VLR) on the y-ordinate. The GSS information is typically represented by a discrete variable as shown in Fig. 1, patients with the same combination of mutations as the same value of GSS. The linear model is perfectly and specifically adapted to the case of two continuous variables. The use of this model is fully linked to the concept of prediction, and considering a single variable X, one objective is to predict the continuous dependent variable Y from a continuous dependent variable X. In the situation of a discrete variable (GSS) and a continuous variable (VLR) the linear model is much less appropriate. This well-known situation, called ‘repeat runs on X’, is described in all textbooks about regression analysis, see for example Draper and Smith , pp. 33–42. Repeat runs, however, have a major impact on R2. The presence of repeat runs on the X variable allows a breakup of the residual SS into a lack of fit and pure error. This situation can be explained by words. For each given value of GSS there are several values of VLR representing the natural variability. Therefore it is simply impossible that these points lie on the regression line, and the computation of R2 should take into account this feature. For example, in Fig. 1a there are four patients with a GSS of 0.5 with a VLR varying from approximately −1.35 log10 copies/ml to −2.15 log10 copies/ml. The residual SS decomposition allows each of these four values to be compared with the corresponding mean of these values, which provides the pure error SS. The lack of fit SS is computed as the weighted difference between the prediction of VLR, given GSS is 0.5, to the corresponding mean of the four values. In addition, the ratio of the mean square of lack of fit to the mean square of pure error provides a test of model adequacy. Such a decomposition also allows the maximum R2 attainable to be computed with this kind of data as follows: max R2 = (total SS − pure error SS)/total SS . The max r2 was 0.4397, 0.4063, and 0.2294 for Stanford, ANRS and Rega GSS, respectively. The r2 for the Stanford GSS was 0.2079 or 20.79%. In other words, we have explained 0.2079/0.4397 = 0.4728 or approximately 47% of the amount that can be explained. For the ANRS and Rega GSS that leads to 69 and 61%, respectively, this figure, although not perfect, looks much better. Such a calculation gives a better sense of what the model is actually achieving in terms of what can be achieved.
The final findings now look a little bit different. First, the three adequacy tests were not statistically significant, indicating that at least the linear model can be employed. It appears that the ANN model including four variables (resistance mutations, drugs in new regimen, baseline viral load and time to follow-up) explained 75% of the variation, whereas, for example, the ANRS algorithm including only information on resistance mutations explained 69% of the variation that can be explained. The two other systems, also including the single information on resistance mutations, explained 47 and 61% of what can be explained. A well-known statistical property of the R2 measure is that its estimated values cannot decrease with the inclusion of new variables in the model. It is then clear that the inclusion of other variables in the three algorithm systems will automatically increase the r2, probably leading to a value close to or even greater than 75%.
We think that this note emphasizes the use of inappropriate statistical methods in analysing such data. The linear regression model has become very popular, maybe too popular, the last few years and the large spread of statistical software has markedly increased its use in inappropriate settings.
1. Brun-Vezinet F, Descamps D, Ruffault A, Masquelier B, Calvez V, Peytavin G, et al
, and the Narval (ANRS 088) Study Group. Clinically relevant interpretation of genotype for resistance to abacavir. AIDS 2003; 17:1795–1802.
2. Marcelin AG, Flandre P, Pavie J, Shmidely N, Wirden M, Lada O, et al
. Clinically relevant genotype interpretation for resistance to didanosine. Antimicrob Agents Chemother 2005; 49:1739–1744.
3. Vora S, Marcelin AG, Günthard HF, Flandre P, Hirsch H, Masquelier B, et al
, the Swiss HIV Cohort Study. Clinical validation of atazanavir/ritonavir genotypic resistance score in protease inhibitor-experienced patients. AIDS 2006; 20:35–40.
4. Brun-Vezinet F, Costagliola D, Ait Khaled M, Calvez V, Clavel F, Clotet B, et al
. Clinically validated genotype analysis: guiding principles and statistical concerns. Antiviral Ther 2004; 9:465–478.
5. Larder BA, Wang D, De Wolf F, Lange J, Revell A, Wegner S, et al
. Accurate prediction of virological response to HAART using three computational modelling techniques. Antiviral Ther 2006; 11:S177.
6. Larder BA, Revell A, Wang D, Harrigan R, Montaner J, Wegner S, Lane C. Neural networks are more accurate predictors of virological response to HAART than rules-based genotype interpretation systems
. Poster presentation at: 10th European AIDS Conference/EACS
. Dublin, Ireland, 17–20 November 2005.
7. Draper N, Smith H. Applied regression analysis. 2nd ed. New York: John Wiley and Sons; 1981.