The recent revelation that Duke University researcher Anil Potti, MD, falsely claimed he was a Rhodes scholar on his curriculum vitae shocked the oncology community. Yet more important for the field is evidence that his microarray data, which led to three clinical trials, contains numerous inaccuracies. Remarkably, scientific critics have been trying to alert oncologists to those experimental problems for three years with little success.
“To a certain extent, I find it ironic that we have been yelling about the science for three years, but that is not what got things rolling,” said Keith Baggerly, PhD, Associate Professor of Bioinformatics and Computational Biology at the University of Texas MD Anderson Cancer Center, who uncovered the errors with his MD Anderson colleague Kevin R. Coombes, PhD, Professor and Deputy Chair of Bioinformatics and Computational Biology. “But I am sufficiently opportunistic that if you give me a way to get attention paid to the science—or to get trials that I really disagree with suspended and have people look at them—I will take that opportunity.”
And take that opportunity he has. With the credentials scandal grabbing the field's attention, Dr. Baggerly, Rafael A. Irizarry, PhD, Professor of Biostatistics at Johns Hopkins Bloomberg School of Public Health, and nearly 40 other biostatisticians and bioinformatics specialists are using the spotlight to call for better standards and more reproducibility in high-throughput experiments, including microarray and proteomic work.
In a letter submitted to Nature, the group notes that independent replication of results is key to science, yet few published microarray studies contain enough detail to allow for it. To correct the problem, they ask journals to require that study authors provide more detailed methods and set out five areas that need specific attention.
Experts note that this is not the first time the field has been misled and say that while the potential new standards are debated in journals and blogs and Internet discussion groups, there are some simple questions and issues oncologists can consider to ensure the community is not misled again by sexy but erroneous data.
Making the Case for More Reproducibility
To illustrate the need for reproducibility, scientists can point to two cautionary examples in oncology, the one from Duke and a proteomics strategy for early detection of ovarian cancer. In each case, the high-throughput data looked impressive on the surface but was subsequently found to be full of mistakes. However, because the papers reporting the work did not contain enough methodological details to enable independent testing and replication, the problems came to light only after the oncology community had invested substantial resources in follow-up work or clinical trials.
“In both cases, the results, as initially reported, were clinically really important,” said Dr. Baggerly. In the case from Duke, Dr. Potti and colleagues claimed they were able to use gene expression signatures derived from cell lines to predict the best chemotherapy for an individual patient. With the proteomics data, Lance Liotta, MD, PhD, then at NCI, and Emanuel Petricoin, PhD, then at the FDA, and colleagues thought they had found a way to diagnose early-stage ovarian cancer based on a blood test.
“Since these would be wonderful [tools] if they worked, they also shared another characteristics,” Dr. Baggerly said. “Within just a few weeks of these papers appearing, we over in bioinformatics got several requests from MD Anderson clinician researchers saying they would like to do it here. But once we started digging into the data, we (a) could not reproduce the results reported and (b) identified other patterns within the data that indicated that something weird was going on.”
In the proteomics project, it appeared that the researchers ran the samples in batches, rather than randomly intermingling samples from controls, cases, and women with benign disease. And some experimental conditions seemed to change between the time the team started running the samples and finished. Therefore, controls looked different from cases and both looked different from benign disease, but the difference was due to the experimental design, not the underlying biology.
“If you've got machines that are sensitive enough to pick up subtle differences in mRNA or protein levels associated with a small change or early disease, I can guarantee that machine is going to be sensitive enough to detect that you changed reagent lots. It is going to pick up batch effects, and you have to keep that in mind,” Dr. Baggerly said.
In Dr. Potti's experiments, the MD Anderson team uncovered multiple problems, some of which are easy to understand and others for which Dr. Baggerly says he has no easy explanation. In the first category, for example, the researchers appear to have shifted the data and the labels relative to one another, so that they were off by one.
In the second category, Dr. Baggerly says that the Duke investigators said their data showed that expression of a particular gene, ERCC1, correlated with response to some agents. However, the commercial microarray chip the Duke investigators said they used in their experiments does not include that gene. “I admit this is one for which I do not have a simple, charitable explanation,” Dr. Baggerly told OT.
New Standards for Reproducibility
Finding the problems in both projects took substantial amounts of time because the methods included in the papers were inadequate. Therefore, Drs. Baggerly and Coombes had to try to figure out and recreate what was done, rather than just retest the model. For the Duke experiments, Dr. Baggerly estimates they spent 1,500 hours trying to figure out what was done and what went wrong. And while he and Dr. Coombes were able to spend that time in these two instances because of the potential clinical importance, Dr. Baggerly says other mistakes and flawed results remain undetected in the literature—in part because the methods sections are insufficient to allow for reproducibility.
Numerous other specialists see the problem as well. For example, in a study published in 2009, investigators assessed 18 published microarray experiments and found enough detail to understand exactly what the authors did in just two of them.
“If you look at the really big picture—and this is the key point—the entire purpose of methods sections in science articles is to let someone else reproduce what you did,” said David F Ransohoff, MD, Professor of Medicine and Cancer Epidemiology, Cancer Prevention, and Control at the University of North Carolina at Chapel Hill. “That is really why it is there. So I can see what you've done, build on that, or, if I want, see if it is right or wrong. And what has happened as studies have gotten more complex, is that is harder to do. But we, as a scientific field, may have to decide if the solution to that is to say that we are not going to try anymore, or to try to figure out how we can preserve that goal, which is a very important goal in science.”
Letter Submitted to Nature
With that concern in mind, the group of biostatisticians and bioinformatics experts, writing under the name “Scientists for Reproducible Research Working Group,” have submitted a letter to Nature requesting that journals require study authors to provide access to the entire data set used for the experiments with adequate annotation. They also want inclusion of information on data provenance, including URLs and database accession numbers, publication of the software code used to analyze the data, step-by-step descriptions of all data manipulations that were done by hand and thus not included in the software code—this might include the steps performed in point-and-click spreadsheet programs, for example—and any prespecified research plans.
Dr. Irizarry and others acknowledge that this may add some work for study authors but is in the best interest of science, and is really about documenting their own work. “Data provenance should be very clear,” he said. “It should be clear what you started with and, if it changes, how that transformation happened. That is not that easy. It is work, but you can't get around it.”
And having those exact methods available can also help researchers recheck their own data and correct any errors or omissions—before they put it in print. “You can get fooled by artifacts, if you get results that aren't right, but appear to be right, and you don't know how to check for problems,” he said.
There have been successful predictors discovered or developed through this type of technology, including MammaPrint and Oncotype DX. “I get a little ticked when people blame the technology, saying it is dirty and what not. But if you know how to use it and not get fooled by artifact and bias, you can do well with it,” he said.
“One metaphor I use is the difference between a shovel and a digger. I can make a hole with a shovel; it would take me a long time, but I can do it. A person who knows how to use a digger would do it much faster than me and make a fine hole. If I get on the digger, I will make a mess. I don't know how to use it. Is it the digger's problem? No. It's that I don't know how to use it.”
In addition to laying out what should be included in the methods section, the Scientists for Reproducible Research Working Group suggest in their letter to the editor, as well as in online discussions and interviews, that study authors who do not want to include the software code or some other aspect of the specified methods should have to explain the omission.
“If in submitting a paper to the journals, you choose not to supply the raw data or code, I want you to have to explain that decision,” said Dr. Baggerly. “Sort of like a disclosure of [financial] conflict of interest. If you say I am not supplying code because it might interfere with future intellectual property rights, then that is something I can use in evaluating how strongly I take the results in this paper.”
Otis Brawley, MD, Chief Medical Officer for the American Cancer Society, which funded some of Dr. Potti's follow-up work and clinical trials, agrees with the main goals of the Working Group. “I think [software] code and data do need to be available at a level they have not been, although I think they can keep something proprietary, but there has got to be a limit on that,” he said.
Looking Forward to…
As for Dr. Baggerly, he says that he doesn't plan on doing another intensive project on another group's data. But if the data are clinically important and he gets requests from MD Anderson clinicians to use the technique, he might. “If you make dramatic claims that have clinical relevance such that several investigators at MD Anderson want to do it—if it is that important—then you should be prepared that people want to check it and check it closely,” he said.
“I am hopeful that the next time this happens I will be in the position of saying ‘Wow this works. We should be using this tomorrow.’ That would be nice.”
Back to Basics
No one expects that the changes requested by the Working Group will be adopted quickly. Therefore, readers and data consumers have to assume some of the burden for making sure the data are valid before the community charges ahead with expensive research efforts or clinical trials based on the novel findings. And for that, the experts interviewed for this article emphasized a need to go back to basics, not only in the Methods sections of papers but also in the consumption of data.
For example, if the data look too good to be true, then people should be cautious. “Science usually proceeds in a rather gradual fashion,” said Donald Berry, PhD, Professor and Head of the Division of Quantitative Sciences at MD Anderson. “If all of a sudden we go from no predictive ability to extremely good predictability, it ought to lead you to think that something is not right there. Contamination can be perfect, biology is not perfect.”
Dr. Baggerly stressed that data consumers should not hesitate to ask if an independent research group has been able to replicate the findings. If another group hasn't yet reproduced the work, oncologists interested in the results should ask if the original research team road tested their predictive model by running it on a completely blinded—and previously unseen—sample set.
The idea of holding back a data set and keeping it completely unavailable to model developers for a final road test is not new, noted Rafael A. Irizarry, PhD, Professor of Biostatistics at Johns Hopkins Bloomberg School of Public Health. “There is a long history of computer scientists and statisticians doing this sort of prediction” in which they take a numerical score of RNA expression or something else and predict an outcome, such as whether a person has cancer or how long they will live.
“For example, the postal service relies on zip code readers that evaluate black and white markings on paper and predict what number they represent. And in a more recent instance, Netflix, the movie rental company, held a contest with a $1 million prize for anyone who could develop a model that could predict what movie a particular viewer would like more accurately than their in-house model. In each case, the developers had access to one set of data for development and internal validation, and the judges held back a data set, completely unavailable to the developers, for external validation.
Perhaps, he said, medical regulators should use a simple bar when evaluating new predictive tests. “Until you show you can do it; you don't use it,” Dr. Irizarry said. “This is not a new problem. It has been around for a while and it is doable.”
The problems that led to the overwhelming acceptance and use of both the ovarian cancer data—which no one thinks has any hint of misconduct—and the Duke data, involved multiple layers, beyond just the study authors. Otis Brawley, MD, Chief Medical Officer of the American Cancer Society, noted that the community used to be able to rely on the author's institution to oversee the quality of work. “But now, in the current age, when people are patenting research and every university has an office of intellectual property, the university, which is supposed to police and supervise the investigator and make sure they are doing things properly and with integrity, has a conflict of interest.”
In addition, financial interest on the part of the researchers could lead to errors as they hurry to claim intellectual property rights. “There can be a tendency to rush, and to get sloppy as one rushes to get a product or finding,” he said. “I really think our attempt to do this quickly and our attempt to do this with some secrecy, with proprietary nature to it—that is what is hurting us.”
Critics also point out that the journals that published the Duke data have been slow to react to the scientific concerns. Drs. Baggerly and Coombes sent letters first to the investigators themselves and, when the errors were not rectified, then to Nature Medicine, Lancet Oncology, and the Journal of Clinical Oncology, which published the work. Nature Medicine published their letter, but the other two journals chose not to, despite their scientific basis.
“That is one of the things that was problematic about this,” Dr. Irizarry explained. “For me, as a statistician working in the trenches with this type of data, as soon as I saw [Drs. Baggerly and Coombes'] talk, it was clear to me that they were complaining about something that was fundamental and important. It was a little surprising to me that the journals did not ask for clarification or retract the paper.” The Lancet published an “expression of concern” in July of this year, but only after the problems with Dr. Potti's resume were discovered.
Meanwhile, Dr. Brawley says that the problems with the Duke experiments and Dr. Potti's resume are unlikely to alter the way ACS funds or evaluates grants in a major way. “It has been rare that we've been burned,” he said. “We encourage open conversations, but won't likely change how we give out grants—though we might go on the internet and search the Rhodes scholars.”