With the availability of computers and easily applicable software programs, the use of statistical methods in medical literature is increasing. There is also a tendency toward increased use of more advanced methods, offered by menu-driven statistical software out of the box. A simple MEDLINE search showed 7,202 hits for “logistic regression” among all original papers published in 2006 while there were only 1,514 in 1995, a 4.7-fold increase. Over this time the number of all papers published increased by 67%. Clearly not all papers using logistic regression can be identified from the MEDLINE search, and it is likely that logistic regression will be mentioned even less as its use becomes more common.

Several papers assessed its application and reporting in different fields of medicine.^{1–7} In the field of obstetrics and gynecology one analysis was published in 1999 and included evaluation of the years 1985, 1990, and 1995.^{2} In 1995, 6.5% of all papers published in four leading obstetrics and gynecology journals included logistic regression. Khan et al^{2} also noted several deficiencies of the reporting quality for logistic regression; the finding was consistent with results in other fields of medicine.^{3–5} Ten years later the fraction of papers using logistic regression is likely to be much higher; it is unclear, however, whether the quality of reporting has improved. We therefore perform an analysis of the use of logistic regression in the obstetrics and gynecology literature in 2005 and 2006.

#### MATERIALS AND METHODS

Based on the Thomson Scientific impact factor we selected the four journals covering the field of obstetrics and gynecology with highest impact scores: *Acta Obstetricia et Gynecologica Scandinavica*, *American Journal of Obstetrics and Gynecology*, *British Journal of Obstetrics and Gynecology*, and *Obstetrics & Gynecology*. We did not include journals with a main focus on infertility treatment like *Fertility & Sterility* or *Human Reproduction*, despite their high impact factors, because they cover only selected areas of obstetrics and gynecology. We performed a manual search of all articles published in 2005 and 2006 and extracted information on the method of statistical analysis applied in the publication. The statistical methods were classified into six groups: logistic regression, linear regression, Cox proportional hazards regression, Poisson regression, other statistical method, or no statistical method. Some papers used several statistical methods. Whenever logistic regression was used, the paper was classified into the logistic regression group. In other cases the classification was based on the most detailed method. We compared the information about the use of logistic regression from manual search with numbers derived from MEDLINE search of these journals to assess whether its use is always reported in the indexed information.

The goal was to evaluate approximately 100 articles in detail. We randomly selected 13 articles that used logistic regression from each journal and year, with a total of 104 articles. To assess the quality of logistic regression in the selected research articles, we used criteria that follow common guidelines and have been used in previous research^{2,3,5} (Table 1):

1. *Criteria for variable selection* Variable selection can seriously affect the model estimates, and thus the process of constructing the final model should be explicitly described. Selection of independent variables for the model can be based on previously demonstrated clinical or biological importance or statistical significance during bivariate analysis. We considered the selection of variables for inclusion in the logistic regression model and not for optimization of the model. We classified articles as fulfilling the reporting criterion if the variable selection process was described or the list of variables were mentioned in the Materials and Methods section.

2. *Coding of independent variables* The coding of independent variables is important for interpretation of results. For continuous variables the units have to be reported as odds ratios indicating change in odds per unit. If the original variables were recoded during analysis, the changes should be clearly described (and justified if necessary). We examined whether articles reported the units of continuous variables as well as the levels of all interval, ordinal, and binary variables. If the information was missing or incomplete for any of the variables, the publications were classified as not fulfilling the criterion.

3. *Ratio of outcome events to independent variables* Whereas the number of outcome events required for multivariable regression was controversially discussed in the literature and there is no absolute rule, most authors recommend a ratio of at least 10 events per independent variable.^{4,8} When this ratio falls below 10, the estimates of regression coefficients are not stable and possibly biased. Models with too many variables for the number of outcome events (or cases in a case-control study) can display a falsely close fit to the data; the phenomenon is called overfitting.^{5} Articles were classified as fulfilling this criterion if this ratio was at least 10:1 for the final model. When several outcomes were examined in one paper, we made the classification based on the lowest ratio.

4. *Conformity to linear gradient* When continuous or discrete ranked variables are included as independent variables in the logistic regression model, a linear relationship on the logit scale is implicitly imposed upon the model. Whereas small departures from this relationship might not affect the model strongly, U-form or J-form relationships (so called because of their form of association with the outcome) remain undetected. As a result, the model is misspecified and leads to false inferences. To avoid this problem several strategies were proposed: converting the variable in question into a categorical variable and assessing the changes in odds ratios across the categories; including higher order polynomial terms (like x^{2} or x^{3}) in the model; or testing the significance of the term x*ln(x) in the model.^{9} We assessed whether any of the strategies were used to test for a linear gradient among articles that included at least one variable treated as continuous in the model.

5. *Interactions* Interaction is the statistical term corresponding to the concept of effect modification. Effect modification occurs when the effect of the risk factor on an outcome differs depending on the value of another variable.^{10} The effect modification can be a serious problem (eg, given a strong association in one group and no association in another, an estimate ignoring this difference is biased for any of the groups). Effect modification can be assessed by stratified analyses, but stratified analyses alone may only illuminate the difference and do not allow conclusions whether the effects are statistically significantly different. Whereas some data sets might be too small for a meaningful detection of interactions, testing for interactions ensures that the model is well specified. To fulfill the interactions, criterion articles had to include an explicit test for interactions. Articles that explained why no interaction term was included in the final model could also be classified as fulfilling the criterion.

6. *Collinearity* In contrast to the assessment of interactions, testing for multi-collinearity is a purely technical step in the analysis. Collinearity uses the connotation of linear regression and, in a simple case for just two variables, can be assessed by testing for correlations. In a case of more complex collinearity, other methods are necessary. Multi-collinearity would be in place when the information contained in one variable is totally or almost totally explained by the effects of the remaining variables or their subset. The inclusion of strongly correlated variables in the regression model can result in biased estimates and corresponding significance tests. Strong correlation can also affect the variable selection in stepwise models. Therefore multi-collinearity should be tested before the logistic regression is performed. The methods are described for example in the textbook by Allison.^{11} We classified articles as fulfilling this criterion if they explicitly reported performing tests for collinearity (although tests for multi-collinearity would be required, we allowed for simple collinearity, too).

7. *Statistical significance* We classified articles as fulfilling the criterion for reporting statistical significance if they reported either *P* values, confidence intervals, or both for each variable's coefficients in the final model. Whereas sole use of *P* values should be discouraged, the current recommendations advocate reporting of both quantities^{12}; we did not evaluate whether confidence intervals were always reported, but just whether any of the statistics was provided.

8. *Goodness of fit* Goodness of fit measures for logistic regression generated some discussion over time, and the solutions are not fully satisfactory.^{9,11} Several methods are available, including Lemeshow-Hosmer goodness of fit summary statistics and equivalents of *R* ^{2} in linear regression. Sometimes goodness of fit is reported in terms of predictive performance of the model based on receiver operating characteristic curves or fraction of correct predictions. There are two uses for goodness of fit: first, it provides information whether or not the model is seriously misspecified, and second, it gives some summary information about how good the variables included in the model are at explaining the phenomenon under study. Articles that plainly reported using such measures met this criterion.

9. *Validation of the model* Strictly speaking, validation is necessary only in the context of predictive models and should be performed using external evidence.^{13} However, there is a broader understanding of validation that also applies to other uses of logistic regression.^{10} In small samples or when some combinations of variables are rare, the estimates obtained from logistic regression models can be biased, especially when variable selection followed statistical significance criteria.^{14,15} Usually the goal of an analysis is not just to obtain information about the studied sample, but also to make a generalization about the population from which the sample was derived. In a new sample from the same population, the same model is most likely to result in different estimates, and the more extreme odds ratios from the first sample based on specific data configuration are especially likely to tend toward 1 in the second sample. This phenomenon is called optimism of the original estimates and indicates a form of bias inherent in statistical modeling.^{16} Thus, for a valid statistical inference, the stability of the model estimates should be assessed. When external validation is not possible, at least internal validation should be performed.^{15,17,18} The most known methods include half-split, cross-validation, and bootstrapping. We classified the articles according to whether they mentioned any form of validation.

The primary criteria were evaluated as not fulfilled (0), fulfilled (1), or not applicable (2). Not applicable was used only for linear gradient in articles in which there were no continuous independent variables or all continuous variables had been categorized. When the criteria were only partially fulfilled or the information was not clear, we classified them along with the not-fulfilled group. Evaluation of the criteria was performed independently by two authors (R.T.M. and A.D.). Cases of disagreement were reviewed and consensus decision was made. The fractions fulfilling the criteria were calculated separately for each journal and combined into a joint weighted estimate according to the number of articles using logistic regression in each of the journals. The differences between the journals were tested by using the Pearson χ^{2} test. Furthermore, a simple additive score based on the nine evaluated criteria was calculated, and the Wilcoxon test was used for comparison between the journals.

We also extracted some further information from the papers: the number of authors, the sample size, the use of multiple analyses in one paper, the inclusion of information about missing values, and whether or not the study was multi-center. Association between the number of authors and quality score was assessed by the Spearman correlation. Additionally, we recorded some qualitative information about the common problems regarding the fulfillment of the criteria.

#### RESULTS

Of 2,234 articles printed in 2005 and 2006 in all four journals, 34.2% (724) used logistic regression as a method of statistical analysis. The proportion in each journal varied: 25% in *British Journal of Obstetrics and Gynecology*, 26% in *Acta Obstetricia et Gynecologica Scandinavica*, 30% in *American Journal of Obstetrics and Gynecology*, and 42% in *Obstetrics & Gynecology* (Table 2). A MEDLINE search using the term “logistic regression” in these journals and the time period brought only half of the papers that were identified in manual search. Linear regression was the second most common regression method, used in 8.6% of all articles during the 2-year span. Cox proportional hazards regression was used in 2.6% of articles, while Poisson regression was used in less than 1% of all articles. Other statistical methods including descriptive statistics, *t* tests, Wilcoxon rank-sum test, χ^{2} tests, analysis of variance, kappa statistics, and Kaplan-Meier life table analysis were the primary means of analysis for 53.3% of articles. No statistical method was reported in 4.6% of articles.

From all articles that used logistic regression, 104 were randomly selected for detailed analysis (13 articles from each journal for each year). The list of the papers and evaluation scores can be obtained from the authors upon request. The number of authors for these papers ranged from 1 to 14, with a median of 4 authors. Sample size in these articles ranged from 32 to 5 million, with a median of 746 subjects. Half of the articles (45.2%) included information on missing values for independent variables, and over half of the studies (56.7%) were reporting data from multiple centers. In 15% of the papers the logistic regression model was declared as predictive (term “predictive models” or use of term “predictors”).

There were no significant differences between the journals in the percentage of articles meeting any given criterion. Minor differences in reporting of variable selection were observed but were not statistically significant. The criteria that were most commonly fulfilled include significance (weighted percentage: 96%), coding of variables (83%), and variable selection (76%) (Table 1). The number of events to independent variables ratio criterion was moderately fulfilled (43%). In the majority of articles, the following criteria were insufficiently reported: interactions (18%), linear gradient of continuous variables (9% of papers that included continuous variables in the analysis), goodness of fit (3.6%), multi-collinearity (0.46%), and validation of the model (0%). After inclusion of selected variables in the model in 11 articles (11%), further variable selection was performed, in most cases using backwards selection method. The maximum number of fulfilled criteria was eight (out of the highest possible score of nine), the minimum zero. Nine percent of the articles fulfilled six or more criteria; the majority fulfilled three to five criteria (74.1%). There were no differences between the journals in respect to the sum score. There was an inverse relationship between the number of authors and the quality score (Spearman rho=–0.2, *P*=.044).

We also noticed that sometimes, in addition to variables mentioned explicitly in the results table, additional variables such as age, parity, and race were listed in a note as included in the adjusted model, but their units or levels were not mentioned. This caused a problem in many of the criteria including the number of events to independent variable ratio, the conformity to a linear gradient, and coding of variables. Some articles failed to explicitly report the number of outcome events examined by logistic regression, and the numbers could be recreated only after extensive calculation. In some articles logistic regression was not the main method, but only marginally used, which might contribute to low scores in these articles.

#### DISCUSSION

The fraction of articles using logistic regression as a means of statistical analysis in obstetrics and gynecology literature has increased from 6.5% of all articles published in the four journals in 1995^{2} to 34.2% in 2005 and 2006. Although several criteria evaluated in this study were fulfilled in most papers (testing for statistical significance, coding, and variable selection), other criteria were only rarely met (conformity to linear gradient, interactions, goodness of fit, collinearity testing, and validation of the model). There was an improvement in the reporting of coding of independent variables from 63.8% in 1995 to 83.0% fulfilling this criterion in 2005 and 2006. The fulfillment of other criteria (conformity to a linear gradient, interactions, and goodness of fit) did not differ between this study and that of Khan et al.^{2}

The most fulfilled criteria are very basic, and clearly their fulfillment is a precondition for publication. When they are not fulfilled, it is more likely that it was just a partial flaw, like one forgotten variable in reporting coding among other variables for which it was correct. Less often fulfilled criteria included technical issues like testing for the linear gradient, testing for multi-collinearity, and reporting goodness of fit. These criteria might not be perceived as very substantial; nevertheless, in certain situations they might invalidate the analysis. For example, testing for linear gradient might reveal nonlinear types of association, which would be of very substantial interest.

The inadequacy and controversy related to the assessment of interaction/effect modification are long-standing and have been discussed in length elsewhere.^{9,19} The reservations toward the assessment of interactions might include the inconvenience of reporting interactions and determination of which interactions should be tested.^{10} Although these are reasonable concerns, one could argue that testing for interactions should be performed to ensure that the final model was correctly specified because strong interaction would invalidate the reported estimates. A statistically and clinically significant interaction can be a potentially important finding.^{20} Clearly, there is an issue of multiple testing, and thus, caution in testing interactions is advised.^{9,19} Interaction can also be statistically significant, but weak and not clinically important. In such cases similar interpretation rules apply as for minimal, albeit significant, findings in studies with a very large sample size.

Validation is a technical issue with implications for the content. Smaller samples are especially prone to provide more extreme results, but rare combinations of specific values can occur also in larger samples, and addressing the optimism of the estimates seems justified.^{10,21} Validation was not performed in any of the 16 articles declaring predictive use of logistic regression. In some of the articles, authors might have used the term *predictors* just as a loose equivalent of “factors associated with.” Formal validation is necessary to assess performance of predictive models but can, in general, provide additional insight into the stability of the estimates. Techniques like bootstrapping are not a general solution,^{17,18} but they are probably still underused, and standards in this area have yet to be developed.^{14}

The quality of the reporting surprisingly decreased with the increasing number of authors. One would expect that larger research groups could be more likely to include authors with statistical or methodological competence, which was found to improve the quality of reporting for statistical analysis.^{22} This was not the case in our study. The cause for this phenomenon is likely to be complex. However, attention to statistical methods in an article by all authors is warranted.

Our study has some limitations. First, we analyzed only articles from four leading journals in the field of obstetrics and gynecology. This might not reflect the current standard in this particular area in general. Second, some of the reporting criteria are not universally accepted as the standard, and authors might feel less obligated to fulfill them, despite the fact that all previous evaluations of reporting quality used them consistently.^{2–5} We do not imply that the analyses did not consider any of the criteria but focus solely on the question of whether they were reported. There might be also some uncertainty about whether some of the criteria did apply in a specific case or authors had good reasons to make exemptions, but evaluation taking into account each specific situation could be difficult and potentially subjective.

In conclusion, despite common use of logistic regression in obstetrics and gynecology literature, several deficiencies in reporting quality remained. Authors should pay more attention to limitations of small data sets in regard to the number of variables included in the analysis. Testing for multi-colinearity and linear gradient should be reported in the Materials and Methods section. Goodness of fit and validation should be considered as supplementary information in the Results section, and interactions (or reasons not to assess them) should be addressed in general. Formal validation is necessary in predictive models, and authors should pay more attention to the distinction between predictive and associative models. It is advisable that important technical aspects should be included in the paper. Many journals provide detailed checklists for the submission process; these checklists could include criteria for statistical methods to help authors standardize the reporting of the analysis.^{23} Although the use of statistical methods is increasing, their reporting quality should be enhanced.