Share this article on:

In Search of a Pony: Sources, Methods, Outcomes, and Motivated Reasoning

Stone, Marc, B., MD

doi: 10.1097/MLR.0000000000000895
Blog 2017

It is highly desirable to be able to evaluate the effect of policy interventions. Such evaluations should have expected outcomes based upon sound theory and be carefully planned, objectively evaluated and prospectively executed. In many cases, however, assessments originate with investigators’ poorly substantiated beliefs about the effects of a policy. Instead of designing studies that test falsifiable hypotheses, these investigators adopt methods and data sources that serve as little more than descriptions of these beliefs in the guise of analysis. Interrupted time series analysis is one of the most popular forms of analysis used to present these beliefs. It is intuitively appealing but, in most cases, it is based upon false analogies, fallacious assumptions and analytical errors.

US Food and Drug Administration, Silver Spring, MD

This article reflects the views of the author and should not be construed to represent FDA’s views or policies.

The author declares no conflict of interest.

Reprints: Marc B. Stone, MD, US Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993. E-mail:

Sequential analysis is a technique that is used in pharmacovigilance that seems to have some value in detecting changes in the incidence of adverse events more rapidly than other methods without an excessive increase in false positive findings. It is intended to work prospectively; the statistical assessment is updated as events occur and are reported until a critical value is reached that indicates that the current rate of reporting is significantly different from preceding observations. This technique is concerned solely with the rate of occurrence of the event of interest; it tells us nothing by itself about what may have caused a change in the event rate. As evidence for a cause, it is purely circumstantial. To credibly associate a cause to the observed change, circumstances must be prespecified and highly specific in either timing or outcome and probably both; otherwise any other plausible explanation would have comparable merit. These requirements have limited the use of this technique largely to looking for adverse events as an immediate consequence of the introduction of a new drug or vaccine into a population. In this situation, observation is limited to patients who are known to be exposed to the drug or vaccine in question; the period of observation for an increased incidence of adverse events is purposely limited in order to greatly reduce the possibility of another cause coincidentally creating an observed increase in adverse events.

In their paper, “Near real-time surveillance for consequences of health policies using sequential analysis,” Lu and colleagues attempt to apply sequential analysis to an assessment of what they describe as a policy intervention with complete indifference to the limitations of this technique. The authors indirectly acknowledge that their study is not an attempt at actual prospective surveillance, presenting it as a proof of concept that simulates “Near real-time surveillance.” To call this paper a proof of concept is akin to painting a bulls-eye over a bullet hole and calling it a proof of concept of target shooting.

Back to Top | Article Outline


It is hard to see how researchers in real-time could act prospectively in the manner simulated by Lu and colleagues. The “policy exposure” that Lu and colleagues chose was not the discrete implementation of a policy but a number of independent actions prompted at different times by different pieces of information: the issue of several advisories by the Food and Drug Administration (FDA) in 2003–2004, followed by a boxed warning that was recommended in late 2004, but not enacted until 2005. This process in fact continued through an expansion of the boxed warning in 2007. After which action would researchers recognize that a “policy” has been implemented and begin surveillance? Would not subsequent actions severely interfere with any further assessment of that “policy”? FDA actions were preceded and accompanied by public controversy involving news reports, journal articles, lawsuits, etc. How would they distinguish the effect of an FDA advisory from ongoing public controversy?

Back to Top | Article Outline


Sensible researchers would next be stymied by the need to prospectively identify a time frame in which an effect of the policy would be likely to have occurred and the outcome could not reasonably be attributed to other factors. The outcomes being measured are not at all specific to the policy; over time it is inevitable that other factors would lead to increases in rates of psychotropic drug poisoning or completed suicide. The rationale by which the authors justify their period of observation places no restriction on when an effect of the policy could be expected to occur. This open-endedness, the broad range of time during which these policy measures took place as well as a failure to specify an expected time lag between the policy actions and the alleged effects of these actions, is unscientific; it does not allow for falsifiability. The literature that the authors cite as evidence of adverse consequences of FDA actions alleges that these reactions occurred in 2005 and earlier but Lu and colleagues claim that any increase in either psychotropic drug poisoning or completed suicide at any time between 2003 and 2010 is a result of the policy. The authors’ results do suggest an increase in psychotropic drug poisonings among adolescents did occur but not until the fourth quarter of 2007, 4 years after the initial advisories and 3 years after the recommendation for a boxed warning for antidepressants. They also claim an increase in completed suicides among adolescents in the first quarter of 2006 but their Figure 3 shows that this is likely a false positive, the log likelihood ratio drops sharply the next quarter and does not approach the critical value again until late in 2008; contrast this with the authors’ findings for psychotropic drug poisoning in their Figure 2 where the likelihood ratio continues to rise after crossing the critical value.





Back to Top | Article Outline


Researchers would need to have immediate access to a data source that would register outcomes as they occur and allow their analysis in “near real-time.” Lu and colleagues did not use such a data source for their “proof of concept.” Not only is their data source retrospective and rather old (2000–2010), it is a secondary collection constructed from a consortium of health plans with differing periods and terms of participation. New data arrive not in real-time but in quarterly tranches. Researchers would also be concerned with generalizability; the data source would need to be representative of the population of interest. The authors’ dataset is constructed arbitrarily; it is essentially a convenience sample. The authors point out that their sample has some demographic resemblance to the general US population but is it is the method of construction that creates generalizability, not the similarity of a few demographic measures. I could probably find a minor league baseball team that resembled the New York Yankees in average height, weight, age, and ethnicity but that would not mean that the minor leaguers are comparable players.

Back to Top | Article Outline


Researchers would also need to prespecify a limited number of outcomes to monitor. The limitation is necessary because of multiplicity concerns; the more outcomes that are considered, the greater likelihood of type I error, detecting a false positive signal. They would have to decide in advance what outcomes were both relevant and likely sensitive to the policy. The principal outcome selected by Lu and colleagues was hospital admission or emergency room treatment for “poisoning by psychotropic agents” (ICD-9 code 969) as a proxy measure for suicide attempts. They selected this diagnosis because an analysis of external injury diagnoses (E codes) relevant to self-harm in their research network database showed E-code completeness varied across study sites, across treatment settings, and across years.1 The investigators did not validate this diagnosis as an indicator of suicide attempt in their database, but cited a published study2 that they claim validates the use of this code as a proxy for suicide attempts, saying that this proxy measure is preferable to use of cause-of-injury codes in settings where use of those codes varies over time. This claim is extremely misleading for a number of reasons like the ones stated below:

  • The proxy that Lu and colleagues use in their analysis is not the one tested and validated in the Patrick paper. In this study, the population was divided into a development sample and a validation sample derived from databases where the use of E codes was largely complete. A large number of candidate proxies were examined for their correlation with suicide attempts as defined by E codes. Those that were promising, that showed high levels of sensitivity and specificity, were then tested in the validation sample. Poisoning by psychotropic agents was one of the candidate proxies that did not make the cut; the proxy measure used by Lu and colleagues literally lacks validation.
  • The Patrick paper reported poisoning by psychotropic agents to have a sensitivity of 38.3% and a specificity of 99.3%, but this is in the development sample. The reason that these types of studies require a validation sample is that candidate indicators tend to be overfitted to the development sample and generally do not perform as well out of sample. The high specificity is neither surprising nor meaningful: the vast majority of hospitalizations do not involve attempted suicide, and very few of these are given code 969; the number of true negatives dwarfs the potential number of false positives.
  • The Patrick study only analyzed hospital admissions. Lu and colleagues applied their algorithm to both hospitalizations and emergency department (ED) visits. Their algorithm is unlikely to perform similarly in patients seen in the ED and discharged as in those who are hospitalized. Patients who are seen in the ED for psychotropic drug poisoning and released are very different from those who are hospitalized. Patients with unintentional poisonings who are medically stable are sent home; those who attempt suicide are likely to be admitted to the hospital for further psychiatric evaluation even if they are medically stable. The Drug Abuse Warning Network (DAWN) was a public health surveillance system that monitored drug-related hospital ED visits and hospital admissions from 2004 to 2011 in order to report on the impact of drug misuse and abuse.3 It also identified suicide attempts involving drugs. It applied proper survey sampling methods so its estimates can be considered representative of the US population. According to DAWN, only 11% of ED visits involving psychotropic drugs were suicide attempts, compared with 41% of hospital admissions. Marijuana accounted for 35% of the psychotropic drugs involved in ED visits but only 5% of psychotropic drugs involved in suicide attempts. (These are numbers for the entire US population, all ages, because the authors make a general claim for the validity of the proxy.)
  • Lu et al claim they needed their proxy measure because E-code completeness varied over sites and time. This is not necessarily a problem for a study if what is being studied is a change in rates of attempted suicide: if the variability across time is uncorrelated across sites, the variability will tend to cancel out. Judging from their published report1 this may well be the case. Compared with complete data, the incomplete data will be noisier but unbiased. Furthermore, there is no evidence that their proxy is any better than relying upon E codes:
    • The Patrick paper did not investigate whether any of their proxy measures varied across sites or over time.
    • E-code incompleteness means that many suicide attempts are not identified. The proxy measure used by Lu and colleagues is not just incomplete; it is incomplete by design, because most suicide attempts do not involve poisoning with psychotropic agents. The reported sensitivity of 38.3%, which could be an overestimate because it comes from the development sample, means that 61.7% of suicide attempts are missed. This incompleteness may not be a problem if the ratio of psychotropic drug poisonings to suicide attempts is constant over time. If we compare annual data on the number of ED visits involving psychotropic drugs (from DAWN) with estimates from CDC of the annual rate of all-method attempted suicide4 we can see (Fig. 1) that there was a substantial increase in psychotropic drug ED visits between 2004 and 2011 and a far more modest increase in attempted suicide. In fact, once population growth is accounted for, the correlation of hospital admissions and ED visits for psychotropic drug misuse with suicide attempts between 2004 and 2009 was strongly negative (−0.72). Even if this correlation were positive across other intervals, the inconsistency and instability of this relationship across time periods would make the proxy especially unsuitable for sequential analysis. If psychotropic drug poisonings are not a valid proxy for attempted suicide in properly sampled and validated national surveys, there is no reason to believe they are any better in the data used by the authors. Interestingly, in the national data, there is a large increase in psychotropic drug poisonings in 2008 that was not accompanied by a comparable increase in attempted suicide but does correspond to the increase in psychotropic drug poisonings beginning in late 2007 that was detected in the sequential analysis.
  • ICD-9 code 969 does not require that the poisoning be a result of intentional self-harm; it need not be a suicide attempt. According to the DAWN data, the proportion of hospital admissions involving psychotropic drugs that was attempted suicide, the positive predictive value, averaged 41% from 2004 to 2011 and ranged from 33% to 46% on an annual basis. This range of variability is likely to be greater if shorter time intervals (eg, quarterly or monthly) are used, as would be the case with sequential analysis. As the positive predictive value is not stable, the incidence of code 969 could change substantially without any change in attempted suicide and the number of suicide attempts coded with 969 could change without a corresponding change in the overall incidence of code 969. Between 2004 and 2009, the correlation between the rate of hospital admission for misuse of psychotropic drugs and the all-method rate of attempted suicide was strongly negative (−0.85).
  • Similarly, the ratio of suicide attempts involving psychotropic drugs (DAWN data) to all suicide attempts (CDC data) averaged 14% but ranged from 11% to 17% on an annual basis. Again, because this proportion is not stable, substantial changes in the rate of suicide attempts involving psychotropic drugs may not be reflected in the all-method rate and vice versa. In fact, between 2004 and 2009, the correlation between the 2 rates was negative (−0.42); the psychotropic drug rate and the rate by other methods had a strong inverse correlation (−0.62).
  • Even if psychotropic drug poisonings were a generally satisfactory proxy for attempted suicide, it makes little sense to use it as a proxy when the concern is that FDA actions discouraged the prescription of antidepressants and other psychiatric care. Antidepressants (and antipsychotics, antipsychotics with indications for treatment of depression also received the boxed warning) are psychotropic drugs. People who present with poisoning from antipsychotics or antidepressants cannot reasonably be considered to have been discouraged by FDA actions from receiving these drugs. According to DAWN, nearly half of the psychotropic drugs involved in suicide attempts were antidepressants (29%) or antipsychotics (17%).


Back to Top | Article Outline


Lu and colleagues promote sequential analysis as a method of analysis of policy outcomes because it could provide results with less data than an interrupted time series analysis. However, the problem with interrupted time series analysis is not lack of data; rather it is an intuition based upon false analogies, fallacious assumptions and analytical errors. This analysis usually involves fitting a line or other curve to a trend over time in a variable before a policy intervention and comparing subsequent observations of the variable to the extrapolation of the curve. Although it is perfectly fine to use a fitted curve to aid in the description of a trend, it is an entirely different matter to turn the curve into a prediction:

  • Epidemiology is not physics; Newton’s laws of motion do not apply. Trends in epidemiology:
    • Are not obliged to follow mathematically specified trajectories.
    • Have no intrinsic momentum. It cannot be assumed that trends will continue by themselves; there must be reasons for continuation. A valid prediction requires better explanatory variables than the simple passage of time.
    • Unlike in space, have natural limits. For example, if the trend of interest is the increasing prevalence of antidepressant users in the population or proportion of the population diagnosed with depression, prevalence cannot exceed 100% and the rate of increase would certainly decline or cease well before prevalence approached 100%.
  • The null hypothesis is overly specific. The tested null hypothesis is not that the past trend continued unchanged after the policy intervention. The analysis discards all possible characterizations of the trend except one, and then tests whether the postpolicy observations match the specific values predicted. It is a test of the prediction, not the policy.
  • As this method is almost always applied retrospectively, when postpolicy data have been collected, cherry-picking can easily occur. For example, Bridge et al5 extrapolated a trend line in adolescent suicide rates from 1996 to 2003 and showed that these rates were higher than trend in 2004 and 2005. This trend, however, was heavily influenced by a decline in suicide rates from 1996 to 1999. Begin the trend line in 1999 and suicide rates in 2004 and 2005 are right with the trend.
  • An apparent change in trend may be nothing more than regression toward the mean. Even if a trend has theoretical justification for being linear (or curvilinear), overfitting to statistical noise will cause the predicted trend to differ from the true trend. Figure 2 shows a hypothetical example. The red line shows the true trend, the blue line is fitted to the prepolicy data, and the dashed line is its extrapolation. Whenever there is any difference between the estimated slope and the true slope, the difference between the observed and expected rates will grow larger in later observations and the more postpolicy observations that are available, the greater likelihood that the deviation from the extrapolated line will be calculated to be statistically significant. There are statistical techniques that avoid this problem but most authors (and reviewers) are unaware that this is a problem. In a 2014 paper6 Lu and colleagues applied interrupted time series analysis to the same dataset and chose the best fit to prepolicy data among a line and other curves, which makes the overfitting problem worse and creates multiplicity issues.
  • Demonstrating a change in trend before and after an intervention does not mean that the intervention is the reason for the difference:
    • When a line (or other mathematically defined curve) is used to represent a trend, any variables that are intended to represent the effects of the intervention will act to correct for non-“linearity” throughout the entire time series rather than specifically reflect any effect from the intervention. When a trend is roughly monotonic but not truly linear, a linear trend line will show statistical significance even though the fit is poor. Just because a linear model cannot be improved upon by adding quadratic or cubic terms does not mean that the trend is, in fact, linear. When a “linear” fit is poor because of non-“linearity” in the data, the fit will be substantially improved by dividing the time series at any point and applying separate “lines” to the 2 parts (Fig. 3).
    • If a true change occurs at any point in a time series, even if it does not correspond to the intervention, dividing the series into before and after periods at the point of the intervention absorbs the change and gives the erroneous impression that the change coincided with the intervention. Figure 4 shows another hypothetical example. The trend line for 2000–2004 is clearly different than the trend line for 2005–2010 but the change actually occurs in 2007. A similar result can be seen in the interrupted time series analysis of Lu and colleagues. The rate of psychotropic drug overdoses among both adolescents and young adults is not much different from the prewarning trend until 2007; this is essentially what is shown in their sequential analysis. By anchoring the interruption to 2005, Lu and colleagues were essentially assuming what they were claiming to prove.


Back to Top | Article Outline


By presenting their sequential analysis as a proof of concept, Lu and colleagues are testing whether the analysis can detect an event that they know to have occurred when they know it to have occurred. If they do not have this knowledge it is impossible to know whether the findings of the analysis, whether positive or negative, are true or false. They are treating their allegations that FDA actions had these specific results as incontrovertible fact. As I have already noted, the evidence they cite as well as their own interrupted time series analysis claim that the effect of the policy occurred several years earlier than what is shown in the sequential analysis. Even if we can be persuaded somehow that the findings are consistent with that evidence, does the evidence justify the certainty implied by the actions of Lu and colleagues? Their sources make 2 claims

  • Most of the sources refer to a decline in antidepressant prescriptions written for adolescents in 2005. This did in fact occur, but there is serious reason to question whether the decline was due to FDA actions and whether such a decline should be expected to result in an increase in suicide attempts or attempted suicide.
    • The sources cited by Lu et al are striking examples of confirmation bias at work. They all assume that FDA actions were the cause of a decline in antidepressant prescribing to adolescents in 2005 without considering any other possible explanation although many other explanations are plausible. For example, the market share represented by generic drugs rose from 40% to over 60% between 2004 and 2006.7 The increasing availability and market share of generic versions of SSRI antidepressants undoubtedly had an effect on the marketing strategy of brand name manufacturers. Annual promotional expenditures by antidepressant manufacturers declined by about 1/3, or about $800 million, between 2004 and 2006.8 Pharmaceutical manufacturers have had success for decades as they continued to spend tens of billions of dollars on marketing and promotion. It is hard to believe that they would have continued to do so without hard evidence that their promotional expenditures influenced sales by considerably more than they had expended. In contrast, critics of FDA actions have no evidence for causation other than post hoc ergo propter hoc.
    • The change in industry promotional practices explains much more than the decline in antidepressant use among children and adolescents. Figure 5 shows trends in the prevalence of antidepressant use in the United States between 2002 and 2013 by age and sex. FDA actions were only concerned with an increased risk in suicidal thinking and behavior in children and adolescents; if they had an effect, the decline in antidepressant use should have been limited to young people. It does not explain why antidepressant use in the United States also declined from 2004 to 2005 in every age and sex subgroup through age 54 and increased in every subgroup aged 55 years and older. If, somehow, the reaction to FDA actions was so strong that it caused antidepressant prescribing to drop not just for children and adolescents but for every age group through age 54, why did prescription rates not also drop, or at least plateau among adults 55 and older? Again, promotional activities provide a possible explanation: the Medicare part D prescription drug benefit was also being implemented, which may have prompted branded drug makers to refocus their still considerable antidepressant marketing efforts to older age groups, those eligible or soon to be eligible for Medicare.
    • As I have written before10 the idea that the incidence of suicide should be inversely related to the prevalence of antidepressant use has intuitive appeal but is logically unsound. It is also empirically untrue. Among data sources that can provide more direct and valid measures of youth suicide and suicidal behavior, including CDC WISQARS and the Youth Risk Behavior Survey, there is no evidence of an increase over 2004 levels occurring before 2009.11 Figure 5 also shows national suicide rates by age and sex. They do not show an inverse relationship between the proportion of the population taking antidepressants and the suicide rate. Among the 220 year to year changes, the suicide rate and the prevalence of antidepressant use move in opposite directions in less than half (94, p=0.88 for an inverse relationship). Thirteen of the 20 age and sex subgroup time series shown in Figure 5 have a positive correlation of +0.50 or more while only 5 show a negative correlation of ≤−0.50. Changes in antidepressant use would appear, if anything, to be a response to changes in the prevalence of depression and risk of suicide in the population, not a cause.
  • The second claim comes from a paper,12 which is a masterpiece of procrustean reasoning:
    • The paper first presents an interrupted time series analysis using annual data anchored between 2004 and 2005 that shows the prevalence of the diagnosis of depression in the pediatric population to be at trend from 1999 through 2004 but declined from trend in 2005. This analysis not only has all of the flaws that I have already described but is based upon a single postintervention data point.
    • The article then presents a second interrupted time series analysis (Fig. 6) that shows the proportionate distribution of diagnoses of depression in the pediatric population among 3 medical specialties. Instead of interrupting the series between 2004 and 2005, as in the first analysis, the authors divided the series at October 2003, 15 months earlier. They ignore the lack of change or a modest increase in 2 of the 3 specialty groups and focus solely on the change in the third group, which appears to be due to the absorption of a decline that began in November 2002, a year before the division point. Even without these flaws, this analysis is nonsensical for 2 reasons:
    • The data give the proportion of diagnoses by specialty, not the absolute number. The 3 specialties shown in the figure account for about 60% of the total. As the overall total must be 100%, there is a complementary group of medical specialists, not shown in the figure, for which an interrupted time series analysis would show the exact opposite of what is described by the authors.
    • According to the first analysis, the rate of diagnosis of depression was increasing between 2003 and 2004. As the overall number of diagnoses is increasing, any change in the proportion of diagnoses by specialty does not reflect a decline in the rate of diagnosis by any group of specialists; rather it shows differences in the rate of increase among these groups.




President Ronald Reagan liked to tell a story to illustrate the difference between a pessimist and an optimist. A pessimistic child is taken to a room full of brand new toys. She shrugs her shoulders and says, “If I start playing with these toys, they’ll just end up breaking”. An optimistic child is taken to a room filled with manure. He dives into the manure pile and starts digging, shouting, “There’s got to be a pony in here somewhere!” (Lest anyone be offended, let us remember that President Reagan considered himself to be an optimist.) Lu et al and other critics may not be optimists, but their approach to the question of the impact of FDA actions concerning the use of antidepressants in young people shows similar unshakeable conviction built upon weak evidence (and absence of effort to distinguish equine from bovine). They have adopted methods and data sources that serve as little more than descriptions of their beliefs in the guise of analysis. It is highly desirable to be able to evaluate the effect of policy interventions. Such evaluations, however, should not be motivated by credulousness and sensationalism. They should have expected outcomes based upon sound theory and be carefully planned, objectively evaluated, and prospectively executed.

Back to Top | Article Outline


1. Lu CY, Stewart C, Ahmed AT, et al. How complete are E-codes in commercial plan claims? Pharmacoepidemiol Drug Saf. 2014;23:218–220.
2. Patrick AR, Miller M, Barber CW, et al. Identification of hospitalizations for intentional self-harm when E-codes are incompletely recorded. Pharmacoepidemiol Drug Saf. 2010;19:1263–1275.
3. Substance Abuse and Mental Health Services Administration, “Drug Abuse Warning Network (DAWN).” 2011. ED Excel Files—National Tables. Available at: Accessed July 3, 2017.
4. Centers for Disease Control and Prevention, WISQARS (Web-based Injury Statistics Query and Reporting System. Available at: Accessed August 20, 2014.
5. Bridge JA, Greenhouse JB, Weldon AH, et al. Suicide trends among youths aged 10 to 19 years in the United States, 1996-2005. JAMA. 2008;300:1025–1026.
6. Lu CY, Zhang F, Lakoma MD, et al. Changes in antidepressant use by young people and suicidal behavior after FDA warnings and media coverage: quasi-experimental study. BMJ. 2014;348:g3596.
7. Ventimiglia J, Kalali HA. Generic penetration in the retail antidepressant market. Psychiatry (Edgmont). 2010;7:9–11.
8. Pamer CA, Hammad TA, Wu Y, et al. Changes in US antidepressant and antipsychotic prescription patterns during a period of FDA actions. Pharmacoepidemiol Drug Saf. 2010;19:158–174.
9. IMS Health Total Patient Tracker, 2014.
    10. Stone MB. The FDA warning on antidepressants and suicidality—why the controversy. N Engl J Med. 2014;371:1668–1671.
    11. Barber CW, Azrael D, Miller M. Study findings on FDA antidepressant warnings and suicide attempts in young people: a false alarm? BMJ. 2014;349:g5645.
    12. Libby AM, Brent DA, Morrato EH, et al. Decline in treatment of pediatric depression after FDA advisory on risk of suicidality with SSRIs. Am J Psychiatry. 2007;164:884–891.

    interrupted time series; sequential analysis; epidemiology; policy analysis

    Copyright © 2018 Wolters Kluwer Health, Inc. All rights reserved.