The Editors' Notepad
The goal of this blog is to help EPIDEMIOLOGY authors produce papers that clearly and effectively communicate their science.

Sunday, January 1, 2017

A well-prepared figure conveys information more effectively than tables or text. A poorly prepared figure discourages readers and uses page space inefficiently. Choosing when to use figures and preparing them well is therefore central to writing an effective manuscript.

Figures in Epidemiology usually either convey features of study design or study results. For figures of both types, the first consideration is whether a figure will convey the information more effectively than text or tables used to occupy the same amount of space. Figures occupy a lot of space in a manuscript—we budget 250 words per figure—so if the figure’s information can be conveyed as effectively in that many words or less, then text should be used instead of a figure. Similarly, when presenting study results, tables provide exact results whereas figures do not. If the figure’s information can be conveyed as effectively as a table, then a table should be used to gain the advantage of the exact data.

The advantages offered by a figure over a table are to visually convey changes in information along the figure axes. If there is no compelling change in information along at least one axis of the figure, then a table should convey the information as effectively and more precisely. Compelling figures show changes in information along both axes, and sometimes in a third dimension as well. For example, meta-analyses often include a forest plot in which the point estimates, confidence intervals, and relative weights of each study are plotted along the vertical axis. You can find an example here. While figures of this design are quite common in the meta-analysis literature, the vertical axis has no function other than to separate studies from one another. Simply ranking the point estimates provides additional information, as shown here for a similar set of studies. Additional information along the study-scale can be added by plotting the inverse normal of rank percentile as the vertical axis scale (shown here), instead of equally spacing the ranked studies. This stretches the outlying studies further apart and compresses the studies near the central tendency closer together, which adds information to what is conveyed on the study scale axis. The point is not to advocate for a change to the way meta-analyses are presented, but rather to encourage authors to design figures that convey information along all axes in their figures.

Once the content of the figure and its axes has been decided, preparation of the figure itself comes next. In general, the quality of published figures would improve dramatically if all authors realized and reacted to one fundamental problem: the default settings of most graphics-preparation tools yield figures in which everything is too small. Line and axis thickness, marker size, font size, error bars: these are all too light or too small in default settings. Simply by making everything bigger and heavier, the quality of figures would improve. Try exaggerating these settings, then ratchet back a notch or two.

The space between the axes is valuable real estate – fill it. The space outside the axes is also valuable real estate: so fill that by using large font labels sparsely placed rather than small sparse labels or (even worse) many small labels.

For all text elements of the figure, choose fonts that are easy to read, usually a sans-serif font such as Arial or Helvetica. Slightly moderate font sizes so that the most important information has the largest font and the least important information has the smallest font. Usually that would mean that axis titles have the largest font, axis labels have intermediate font, and legend text and other text elements such as data labels have the smallest font.

Avoid clutter in the figure. Never use figure titles because the caption will suffice. Do not outline the figure or the plotted areas because the axes will suffice. Make sure that every data element is important. For example, in a plot showing results stratified by gender, do not include a line for all genders combined unless that combined information is as important as the gender-specific information. Grid lines should also be avoided. If the location of plotted elements must be so closely inspected as to require grid lines, then the data are probably better suited to a table than to a figure.

Embed legends between the axes, rather than above or below, especially when the distribution of data leaves blank spaces between the axes. Using this empty space for the legend allows the size of the chart to increase because no space is reserved for the legend. Legends are more effective when embedded in the figure than when embedded in the caption. Even better is to label plot elements with text labels directly next to the element, thereby deleting the legend.

Many authors present results in figure panels. These can be quite effective when used judiciously. To start, consider whether a single figure can be used instead of a figure panel by reducing the number of compared categories. If there are so many essential data categories that a figure panel is required, then always keep the scale of all axes constant in all panels of the figure. The point of a figure panel is to visually compare results within and across panels. If the axis scales change in different panels, then the visual comparison across panels will be misleading.

While we encourage authors to embolden their figures by making elements large and thick, we strongly discourage the use of ornamentation such as shadows, shaded backgrounds, and word art. Three dimensional figures are, in general, very difficult to comprehend. Unless a surface must be plotted, it is better to convey the third-dimension information as separate lines within the plot or in a figure panel.

Figure captions are critical to high-quality figures. A figure caption should describe what the reader will find in the figure and from what data it was generated. Readers should be able to picture the figure and understand the study setting in which the data were generated by reading the figure caption. Avoid duplicating the caption information in the main text of the manuscript, but be sure to define any abbreviations, even if they are also defined in the main text. Although Epidemiology disallows almost all abbreviations in the text (see our earlier blog), we will sometimes allow abbreviations in figures (data or axis labels, for instance) that we do not allow in the text.

Epidemiology accepts figures prepared in color. Figures printed in color incur extra charges, as described in the instructions for authors, unless the authors have also paid an Open Access fee. Authors can submit figures in color to appear in only the on-line version, and a gray-scale version to appear in print and in the PDF version. This type of submission incurs no extra charges, but then it is imperative that all figure elements are as easily identifiable in the gray-scale version as in the color version and that the figures are identical except for the color itself. When preparing color figures, authors must choose colors that make the figure content accessible to persons with color vision impairments. Be careful that the graphics software does not gratuitously add color in the form of a pastel background or other elements that add no information.

Our editors examine the quality of figures at different sizes during the editing process, but sometimes when a figure appears in page proofs it looks fuzzy. We will then ask the author to submit a higher-resolution version. The best ways to avoid these last-minute inconveniences are to submit the highest resolution figure that is practical (1200 dpi should be good for most line drawings) and to export the figure directly to a graphics format (.tif, .png, and .pdf work best) rather than re-importing into Word or Powerpoint.

Creating a compelling figure requires a substantial investment of energy and creativity. The guidance above may help to avoid common pitfalls, but the quality of the figure will ultimately be determined primarily by the effort put into the creation.

Tuesday, November 1, 2016

Let’s say right up front that, under our hybrid publishing model, space is limited in the print version of EPIDEMIOLOGY. You already know this. We have a strict budget for print pages each issue, and competition for space is fierce. We have been working with Production to make the best use of this limited space, by thinking about the efficiency of the page layout and by keeping an eye on proofs to avoid mostly blank pages. A great way to advance the goal of space efficiency is to put content online, in supplementary digital content (SDC). Your editors may ask you to do this, for example with sensitivity or subgroup analyses, or you can opt to do so voluntarily. Shorter papers are often more engaging to read, authors save page charges, and the journal can publish more papers within its page budget. Everybody wins when papers are short, as long as they are also complete.

In contrast to printed content, online content is essentially unlimited, a service provided by the publisher for the free use of authors. What can go online? Pretty much anything you have produced that supports what you have written in the main text.  SDC is a good place to park large tables and figure panels, descriptions of study populations, details of methodology, and statistical computing code (which we encourage all authors to submit as SDC). You can also use color freely; color figures come with a fee in the printed journal, but are free in SDC. You, the author, are fully responsible for SDC. Although peer reviewers and editors look at it, we don’t copy-edit it; SDC goes up exactly as you have prepared it (which means it’s probably not a bad idea to save it as a PDF, rather than editable or readily copied text). We create a link to the SDC and place it appropriately in the printed text. If it needs to be revised or corrected, you can email us a new version and we’ll just swap them out.

Our only restrictions: because of server limitations, each file has to be no more than 100 MB in size. Larger total amounts of content can be broken down into smaller files. In addition, labels of sections need to correspond to the way you refer to them in the text, and for that the journal has a convention:

eTable 1, eAppendix 2, eFigure 4 etc.

Most types of content will fit into these categories, with ‘eAppendix’ referring to any text that is not a table or a figure. Numbering them helps guide readers to the relevant content, especially when all the content is saved in a single file and, as with tables and figures appearing in the main (printed) text, make sure they are cited in order.

We have been told that combining all the SDC as a readily downloadable file is helpful to readers, so we will usually ask you to combine them. Most file types, except for spreadsheets with formulas and PowerPoint files with animation, can be saved as PDFs and combined. Statistical computing code is usually in text format, and so can be exported or at least copied and pasted into a word-processing file and, from there, exported to PDF. If you have more than a handful of sections of SDC you may also want to consider including a table of contents at the top.

Because SDC is a separate document, it must - if you are citing other work - have its own bibliography. Number citations in SDC separately from those in the main text (citations can appear in both the published bibliography and the SDC), but you only need one bibliography across all the SDC content. Our copyeditors will go look at your main-text bibliography to make sure there is a corresponding citation for each reference, and will flag any that have none, or that are only in the SDC (which, again, they don’t edit). Please also note, however, that citations in SDC do NOT count towards the Science Citation Index or other indexing services.

Naturally, the less space each paper takes up, the more papers we can publish, and you, as authors, can help with this, too; it’s one small contribution you can make toward being a good member of the research community.

Tuesday, August 30, 2016

The typical outcomes paper in epidemiology usually involves a lot of numbers – multiple exposures and measures of exposures, subgroup analyses, and alternative modeling strategies. The standard of practice when making statistical comparisons is to place an effect estimate within a confidence interval, rather than using a p-value (Epidemiology generally only allows p-values for tests of trend or heterogeneity, and even then strongly discourages comparison with a Type 1 error rate). Outcomes papers thus tend to have three or four tables of data, often with more online, each with up to a dozen columns, but organized in intuitive, digestible, easy-to-follow chunks. If figures are possible, so much the better.

Writing the text of the Results section to summarize the tables and figures may feel like an afterthought. But it is still important, in part because you, as a researcher, know your data better than anyone else, and also because not all readers absorb information the same way. So it’s worth your time to think about what you want to highlight (hint: go beyond the obvious statements along the lines of x was associated with y, z was not associated with y).

I hope you’ll agree it’s also important to make the Results section appealing and useful to read. Many results sections fail to provide any mention of the descriptive finds. These, however, help to put the study into context. How many people were eligible, how many participated, how many cases were observed, and what were patterns of missingness? These and similar questions immediately help the reader to understand who was studied and the quality of the evidence.

When transitioning to internal comparisons, one element to keep in mind is context. Even if you’ve done so in the Methods section, precede each result you give with a hint of what you were looking for in that step of your analysis. Just as important is the flow of language. Of course we don’t expect an epi Results section to read like Walt Whitman, but you’d be surprised how a strategy regarding the presentation of data can improve how well the reader engages with it.

I’ll start with an example of a sentence that, while not particularly long, is seriously hard work to get through:

Similar results were found for lung cancer, colorectal cancer, and breast cancer: lower consumption of jelly beans was associated with an estimated 4%-8% lower hazard ratio (95%CI 0.67 to 1.22, 0.76 to 1.34, and 0.92 to 1.13, respectively), although these estimates were imprecise.

Do you see how you have to go back and forth from the outcomes in the first line to the confidence intervals in the third to match them up, because of the “respectively” device? In addition, it’s hard to parse that range of percentages of lower risk - if there are only three outcomes, why not give just give all three? (More about the imprecise estimates below.) To simplify, keep each outcome in the same phrase as its data:

Consumption of jelly beans was associated with a 4% lower hazard ratio (95% CI 0.67, 1.22) of lung cancer, 7% lower risk (95% CI 0.76, 1.34) of colorectal cancer, and 8% lower risk (95% CI 0.92, 1.13) of breast cancer, although the estimates were imprecise.

A second concern is the use of the percentage hazard ratio. It is too easily confused with a difference estimate of association, when in fact the associations are estimated on the ratio scale. Furthermore, it has different different units than the CI, so you can’t automatically place it within the interval. An even better revision would be:

The hazard ratio associating consumption of jelly beans with lung cancer was 0.96 (95% CI 0.67, 1.22), with colorectal lung cancer was 0.93 (95% CI 0.76, 1.34), and with breast cancer was 0.92 (95% CI 0.92, 1.13) of breast cancer, although the estimates were imprecise.

Next, I hope this idea is not too radical, but consider not putting data in a sentence at all: leave the numbers in the table, if possible, and describe the results in words. That way, a reader can first read your simple summary, and then turn to the tables to pick out the details for him or herself. This strategy works best for secondary findings; results pertaining to the primary aim should always be reported with data. Revising the report of these secondary findings, the edit of the sentence would be:

Consumption of jelly beans was associated with imprecisely measured decreased hazards of lung, colorectal, and breast cancer (Table 3).

Finally, what exactly do the authors mean when they say that the estimates were imprecisely measured? The intervals were actually fairly narrow. We suspect they mean that the intervals include the null, which has nothing to do with the precision. The final, zen edit of the troublesome sentence would be:

The hazard ratios associating jelly beans with the incidence of lung, colorectal, and breast cancer were all near null (Table 3).

We invite you to look at a few outcomes papers and think about the above. Do you even read Results sections? If not, why not? What would you do differently? We’d be happy to discuss.

Take-home messages that will take you a long way toward a readable Results section:

  • Be sure to open the results section with the descriptive findings

  • As the topic sentence in each paragraph, provide a bit of context for each section of the analysis.

  • Keep the outcome with its data (avoid the dreaded “respectively”).

  • Break up long sentences containing a lot of data.

  • Be sure to use the measure of disease occurrence that you are estimating (“risk”, “rate”, “hazard”, etc).

  • For secondary findings, consider leaving effect estimates and confidence intervals out of the text altogether.

While the above recommendations are stylistic, here’s a reminder of a couple of additional requirements relevant to reporting of results in Epidemiology: Avoid causal language – verbs such as impact, affect, increase/decrease – in favor of the language of association. And avoid significance testing as follows:

  • Leave out p-values (except for tests of trend and heterogeneity, but even then do not compare with an acceptable Type 1 error rate)

  • Instead of “x was not significantly associated with y,” just say “x was not associated with y” or “x was associated with an imprecisely measured increase/decrease in y” or “the association of x with y was near null”

  • Avoid the word “significant” in non-statistical senses of the word, and instead choose from the less-loaded words “considerable,” “important,“ “material,” “appreciable,” or “substantial.”

Null results are good! We have recently published an editorial​ seeking persuasively null results. You might even edit the result in the example one step further:

Consumption of jelly beans was not associated with decreased hazard of lung, colorectal, or breast cancer (Table 3).

Sorry, jelly beans.

Monday, June 6, 2016

My inaugural post to this blog discusses abbreviations and how we treat them at EPIDEMIOLOGY: mostly, I’m afraid, we avoid them, as you’ll know if you have worked with me. But today, I am happy to explain why. Epidemiologists, we are in this together.

In my role as Deputy Editor, also known as Science Wordsmith-in-Chief, I spend more time considering and (usually) spelling out abbreviations than on any other class of edits. That’s because, in addition to scientific accuracy, a top goal is to deliver papers that are clearly written and as effortless for our target audience to read as possible. 

And as someone with an epidemiology PhD whose training may have gotten a little rusty, I may be a useful test case. I’m sure, for some of you, reading a methods paper is like falling off a log. You do this stuff all the time. You can glance briefly at a formula consisting of stacks of Greek letters meaningfully embellished with bold and italics, and the concept behind a method for correcting for selection bias crystallizes in your mind in three dimensions. Similarly, a new regression model with a 10-syllable name attached to a 10-letter abbreviation sticks firmly in your mind. I know, because I trained with many of you and now I read and am impressed by your papers…which I have to read slowly. I envy you a bit, but never mind: mainly, I want to learn what you to have to offer.

But because I don’t get to spend most of my days immersed in methods and biostatistics, it’s helpful to have an unfamiliar abbreviation spelled out each time it’s used. Our readers and I sometimes have to work to decipher and internalize the concept behind the method. Our work is easier when we can avoid thinking ‘Wait, what does that stand for?’ and having to scroll up, find, and re-read the definition…and usually lose the train of thought.

Overall, spelling out abbreviations helps forward our goal of publishing epidemiology papers that read like English, not like jargon.  Therefore, please think of your wider community of colleagues and spell it out—our rule of thumb is whether it would be understandable to someone outside your subspecialty. If you don’t, I will, and rather than use search-and-replace I will do it each time individually and look for ways to avoid wordiness and awkward phrasings that sometimes arise. However, it does take time, and really, I suspect you can do it more smoothly and accurately than I can, if you do so as you write.

We understand there are other reasons you might want to use abbreviations. For example:

* To popularize a new method. We sympathize. But if the name of a method is really unwieldy when spelled out, an acronym will naturally evolve, and there may be workarounds (see below). Meanwhile, as above, allowing broadly trained epidemiologists access via conceptual transparency that avoids the hard work of repeated scrolling up to a definition, can also accomplish the goal of popularizing it.

* It’s the shorthand you use within your research team.

* To meet the word limit. Sorry, but you’re busted, and my colleagues who write a lot assure me there is always a way to shorten a paper that does not compromise clarity.

* To avoid typing. Really? OK, never mind, I can’t believe you would do this.

Meanwhile, there are additional reasons to spell out:

* To avoid ambiguity. As an example, MSM abbreviates “men who have sex with men” to one community of epidemiologists and “marginal structural modeling” to a second community. For a reader who is not an enshrined member of either community, the abbreviation is ambiguous without context to help.

* To make sentences flow better. Many abbreviations are more awkward to read and pronounce than their spelled-out forms.

* To avoid bureaucracy-speak, which is not a recognized dialect of English. Those who work for large government agencies should be particularly able to relate to this.

So, when will we allow an abbreviation?

* When it is likely to be familiar and unambiguous to most epidemiologists - I understand this is a judgment call, and in some cases my thinking has evolved.

* When it is impossibly unwieldy to read when spelled out.

* When it is used as a variable name in an adjacent equation (in which case it will also be italicized).

* In tables and figures, to help save space, but it must also be defined in a legend or caption.

* For study names and similar proper nouns.

If spelling out is moderately wordy or unwieldy, I will try to find a workaround (for example: ‘hereafter referred to as…’), such as a partial spelling out, or using pronouns. And finally, I often don’t make these decisions unilaterally, and will check with other editors.

Sunday, November 27, 2011

The recent publication in EPIDEMIOLOGY of a graph about semen quality over time [1] - data that were somehow buried in a governmental report in Denmark -  again raises the much-debated point of public access to data [2, 3, 4].
The mere fact of questioning a policy of public access to data, seems like being ‘against motherhood and world peace’. Isn’t it true that “Science is about debates on findings,” “Science serves people, and people (taxpayers) paid for it,” and “Expensive research data should become available to others”?  Yet, the issues are more complex than the simple idea that ultimately we will all benefit from open access to data.
Firstly, what is meant by ‘data’? The original unprocessed MRI scans, blood, tissue, questionnaires? Or the processed data – determinations on blood, coded questionnaires? The cleaned data - with the possibility that the authors already have ‘massaged’ inconveniences? The analysis files – in which the authors have extensively repartitioned and recoded the data (another round of subjective choices)? Data should be without personal identifiers – of course – but in our digital age people can be identified by combinations of seemingly innocent bits of information. And, finally, should all discarded analyses, or discarded data, also become publicly available – to check what the authors ‘threw way’ and whether their action was ‘legitimate’?
Secondly, to what extent is the public as the taxpayer, or any organization that pays for the research, really the full owner of the data? Data exist because of ideas about how to collect and organize them. There is intellectual content, not just by the researchers, but also by their research surroundings, their departments, universities, and governmental organizations that make research intellectually possible. Data in themselves are not science. Giving your data to someone else is not an act of scientific communication. Science exists in reducing data according to a vision - some of which may develop during data analysis. Should researchers not have a grace period for the data they collected, or perhaps two: first a period in which they are the sole analysts, and then a period in which they share data only on conditions?
Thirdly, how protective can a researcher remain about her data? Should a researcher have the right to deny access to her data to particular other parties? Richard Smith, the former editor of the BMJ, stated in his blog that denying access is a wrong strategy – why fear open debate, it will only lead to better analyses? In his opinion, one should not deny data access even to the Tobacco Industry [5].
Reality is different: researchers know that when a party with huge financial interests wants access to data, there are three scenarios.
Scenario 1: they search and find some error somewhere in the data.  This is always possible –no data are error-proof. The financially interested party will start a huge spin-doctoring campaign, proclaiming loudly in the media that the data are terrible. Remember the discussions on the climate reports?
Scenario 2: another analyst is hired by the interested party, and comes to the opposite conclusion. This is published with a lot of brouhaha. The original researcher writes a polite letter to the editor, explaining why the reanalysis was wrong. The hired analyst retorts by stating that it is the original analysis which was in error. Soon, only the handful of people who really know the data can still follow the argument. That is the signal for a new wave of spin-doctoring, in which medical doctors give industry-paid lectures stating that “even the experts do not know any more; we poor consumers should use common sense; most likely, nothing is the matter”. I witnessed this scenario in a controversy on adverse effects of oral contraceptives. A class action suit was deemed unacceptable by a UK court because, in a meta-analysis in which two competing analyses of the same data were entered (!!), the relative risk was 1.7.  This number fell short of the magical 2.0, which is wrongly held by many courts as proof that there is ‘more than 50% chance’ that the product caused the adverse effect [6].  Without studies and reanalyses directly sponsored by the industry, the overall relative risk was well over 2.0 [7]. This was money well spent by the companies!
Scenario 1 and 2 have a name: “Doubt is our product” as it was originally coined by the tobacco industry: it is not necessary to prove that the research incriminating your product is wrong – nor that the company is right – it suffices to sow doubt. [8]
Scenario 3 is that the financially interested party subpoenas the researcher to testify over all parts of allegedly questionable aspects of the data in court. Detail upon detail is demanded. The researchers lose months (if not years) of research and their personal life. That scenario was played out against epidemiologists who did not find particular adverse effects of silicone breast implants [9]. It is recently feared again as the next strategy by the tobacco industry in the UK [10].
Advocates of making data publicly available seem to live in an ideal dream world, in which for every Professor A whose PhD students always publish A, there exists a Professor B whose PhD students publish B. Such schools of thought combat each other scientifically with more or less equal weapons. Other scientists watch this contest and make up their mind as to who has the strongest arguments and data. This type of ‘normal science’ disappears when strong financial incentives exist. Then the weapons are no longer scientific publications, but public relations agents and lawyers. Of course, also in ‘normal science’, there are rivalries that can be strong. It happens that researchers do not want to share their complete data, or only part of the data under conditions. Often this is for the very simple reason that some sources of data, like blood samples, are finite.
Calls for making data publicly available need to take into account these scenarios. Some people hope that open information in the long run provides the ‘real’ truth. But in a shorter timescale, open information may also allow mischief by special interests, with plentiful resources, that are ruthless in their attempts to shape public policy. It seems difficult to ‘experiment’, i.e. to try open access to data for some time and then turn it back when the drawbacks seem too great.
An intermediary solution might be much more easy to implement. Tim Lash and I, following ideas of others, have proposed  to make public registries of existing data [11]. This would make it possible to start negotiating with the owners of the data about possible re-use. Such a registry might also facilitate the use of data in ways that were not originally planned. If controversy and distrust complicates the picture, trusted third parties can be sought to organize a reanalysis, with public input possible – a strategy recently proposed by a medical device maker [12].
In short, public access to data is much more complex than the proclamation of some principles that look so wonderfully scientific that nobody can argue against them.
Commentaries about this topic are greatly welcome. They can be published a full guest blog of about 450 words maximum. Please mail to
[1] Bonde JP, Ramlau-Hansen CH, Olsen J. Trends in sperm counts: the saga
continues. Epidemiology. 2011 Sep;22(5):617-9
[2] Hernán MA, Wilcox AJ. Epidemiology, data sharing, and the challenge of
scientific replication. Epidemiology. 2009 Mar;20(2):167-8
[3] Samet JM. Data: to share or not to share? Epidemiology. 2009 Mar;20(2):172-4
[4] Colditz GA. Constraints on data sharing: experience from the nurses' health
study. Epidemiology. 2009 Mar;20(2):169-71
[6] McPherson K. Epidemiology on trial--confessions of an expert witness. Lancet.
2002 Sep 21;360(9337):889-90
[7] Kemmeren JM, Algra A, Grobbee DE. Third generation oral contraceptives and
risk of venous thrombosis: meta-analysis. BMJ. 2001 Jul 21;323(7305):131-4
[11] Lash TL, Vandenbroucke JP. Should preregistration of epidemiologic study protocols become compulsory? Reflections and a counterproposal. Epidemiology (In Press)
[12] Krumholz HM, Ross JS. A model for dissemination and independent analysis of
industry data. JAMA. 2011 Oct 12;306(14):1593-4
Note: an earlier version of this blog was published as an opinion piece in the Dutch language newspaper NRC-Handelsblad in the Netherlands on 12 October 2011
© Jan P Vandenbroucke, 2011