A Thematic Survey on the Reporting Quality of Randomized Controlled Trials in Rehabilitation: The Case of Multiple Sclerosis

Background and Purpose: Optimal reporting is a critical element of scholarly communications. Several initiatives, such as the EQUATOR checklists, have raised authors' awareness about the importance of adequate research reports. On these premises, we aimed at appraising the reporting quality of published randomized controlled trials (RCTs) dealing with rehabilitation interventions. Given the breadth of such literature, we focused on rehabilitation for multiple sclerosis (MS), which was taken as a model of a challenging condition for all the rehabilitation professionals. A thematic methodological survey was performed to critically examine rehabilitative RCTs published in the last 2 decades in MS populations according to 3 main reporting themes: (1) basic methodological and statistical aspects; (2) reproducibility and responsiveness of measurements; and (3) clinical meaningfulness of the change. Summary of Key Points: Of the initial 526 RCTs retrieved, 370 satisfied the inclusion criteria and were included in the analysis. The survey revealed several sources of weakness affecting all the predefined themes: among these, 25.7% of the studies complemented the P values with the confidence interval of the change; 46.8% reported the effect size of the observed differences; 40.0% conducted power analyses to establish the sample size; 4.3% performed retest procedures to determine the outcomes' reproducibility and responsiveness; and 5.9% appraised the observed differences against thresholds for clinically meaningful change, for example, the minimal important change. Recommendations for Clinical Practice: The RCTs dealing with MS rehabilitation still suffer from incomplete reporting. Adherence to evidence-based checklists and attention to measurement issues and their impact on data interpretation can improve study design and reporting in order to truly advance the field of rehabilitation in people with MS. Video Abstract available for more insights from the authors (see the Video, Supplemental Digital Content 1 available at: http://links.lww.com/JNPT/A424).


INTRODUCTION
R esearchers are increasingly called to enhance not only the quality of their research but also completeness and transparency of the reports they attempt to publish. Optimal reporting permits higher replicability and allows readers to fully understand how a study was conceived, designed, and executed. If data collection and presentation are adequately reported, readers may be able to critically appraise and interpret the study findings. During the last 2 decades, several initiatives have been launched to increase the awareness of authors, active in the biomedical and clinical areas, about the importance of preparing adequate research reports. Among these, the EQUATOR (Enhancing the QUAlity and Transparency Of health Research, https://www.equator-network.org/) Network is the international reference for scientists from all research fields when using evidence-based reporting guidelines. EQUATOR maintains checklists for observational and experimental designs. Pertinent to the present communication, the CONSORT (Consolidated Standards of Reporting Trials) statement, which is part of the EQUATOR, was devised to alleviate problems arising from inadequate reporting of randomized controlled trials (RCTs). 1 At its core, the CONSORT consists of a minimum set of recommendations that help authors preparing their reports. Introduced in 1996, 2 it was developed and expanded in 2001 3 and further revised in its current 2010 version. 1 The CONSORT now stands as the reference checklist for RCTs. Several analyses have evidenced the positive impact of adhering to reporting checklist like, but not limited to, the CONSORT. Overall, they expand the reliability, utility, and impact of health research. 4,5 A recent methodological survey 6 critically assessed the reporting quality of 571 neurophysiological/transcranial magnetic stimulation articles dealing with the assessment of motor dysfunction in neurological populations. Weaknesses in reporting and data presentation included issues relating to methodology, statistics, reproducibility, consistency, accuracy, and responsiveness of the relevant measurements affecting most of the studies surveyed.
As introduced previously, adherence to reporting checklists has been suggested as a promising avenue of development. Along with other major rehabilitation journals, the Journal of Neurologic Physical Therapy endorsed the EQUA-TOR initiative in 2014. 7 Since then, authors wishing to have their research reports considered for publication are required to comply to the pertinent checklist for their study design (eg, CONSORT, SPIRIT, STROBE, PRISMA, etc).
On these premises, we completed a methodological survey of the reporting quality of published RCTs dealing with rehabilitation interventions for people with multiple sclerosis (PwMS). Given the breadth of such literature (43 713 RCTs retrieved in the 2001-2020 period; key word: rehabilitation. Source: PubMed/MEDLINE; effective date: December 31, 2020), we decided to operate within our area of expertise focusing on RCTs dealing with rehabilitation for multiple sclerosis (MS), which was taken as a model of a challenging condition for the rehabilitation professionals, although all rehabilitation interventions are inherently difficult to study. Indeed, PwMS exhibit large day-to-day fluctuations in functioning, strength, and fatigue, which may translate into high variability and low reproducibility in the outcomes assessed. 8,9 If not adequately captured, such variability reduces the clinician's ability to evaluate PwMS and prevents optimal tracking of the changes following therapeutic interventions. In this regard, the responsiveness to change of a measurement, that is, the ability of an instrument to detect change over time in the construct to be measured, 10 is closely related to its test-retest reproducibility, making it an important element to consider and report in clinical trials. This is crucial for those populations who display unstable motor performances. 11 Therefore, no efforts should be spared in quantifying the measurement error that surrounds true scores by directly determining measurements' reproducibility or, when this is already established in the literature, by specifically referring to the psychometric features previously reported. Relatedly, whether investigators interpret their findings against established thresholds for clinically important change, such as the minimally important change (MIC), is another aspect that deserves attention to identify the amount of change that a patient can perceive as practically beneficial for his or her functioning. To date, however, reproducibility, responsiveness, and clinical importance appraisal of the changes are not part of standard reporting checklists.
Although the present work was conceived having in mind the initiatives and structured checklists hosted by the EQUATOR, by this investigation we primarily aimed at expanding on the aforementioned issues, as they can affect the reporting quality of clinical trials.
In addition, we planned to examine the quality of methodological and statistical reporting of the studies, with a focus on specific statistical items relating to the reporting of changes observed following rehabilitation (ie, P value, confidence interval [CI], effect size [ES], type of ES, study power) while leaving other relevant aspects, such as randomization, concealment, blinding, etc, out of our analyses.
The general objective of this thematic survey was to appraise the reporting quality of RCTs on rehabilitation interventions for PwMS. To this aim, 3 main reporting themes were predefined as follows: (1) methodological and statistical aspects; (2) reproducibility and responsiveness of measurements; and (3) clinical meaningfulness of the change.

METHODS
Two decades of literature were vetted (2001-2020). Subgroup analyses were planned to compare the completeness of reporting based on four 5-year temporal quartiles of publication date. Figure 1 depicts the PRISMA flowchart and the screening process for study selection. Table 1 summarizes the criteria used to check whether the included studies satisfied the requirements of methodological and statistical completeness.

Study Selection
Three electronic databases (PubMed/MEDLINE, Scopus, Web of Science) were searched for all available articles written in English. The search was restricted to the 20 years following the publication date of the seminal works that prompted evidence-based checklists to enhance the quality of scientific reports. 3,12 The initial search was undertaken by 3 authors (L.V., A.M., G.M.). The search included Medical Subject Headings, key words, and matching synonyms relevant to the topic. The search strategies employed in the databases are presented in Supplemental Digital Content 2, available at: http://links.lww. com/JNPT/A417.
Based on titles and abstracts, studies clearly out of scope were manually excluded. Animal studies were not considered. To be eligible for inclusion, articles had to meet the following criteria: enrollment of participants with definite diagnosis of MS; administration of a rehabilitative intervention program (least duration: 2 weeks); and RCT design.  When the title or the abstract presented insufficient information to determine eligibility, the full text of each article was scrutinized. Based on the information in the full text, eligible studies were considered for data extraction. In case of disagreements, consensus was reached by discussion. To ensure homogeneity, weekly team meetings were held to cross-check the studies.

Data Extraction
A customized data extraction form was developed. The extracted information referred to whether the authors of the individual studies had satisfied the methodological and statistical requirements ( Table 1).
The manual extraction process was coded into 3 main themes: (1) methodological/statistical aspects, and results reporting, for example, power analysis, trial registration, reporting of the ES, CI of the difference/change, P value, exact P value (whether an exact or approximated value was reported), and study limitations; (2) reproducibility and responsiveness of measurements, for example, test-retest, reproducibility cited, standard error of measurement (SEM), and minimal detectable change; and (3) clinical meaningfulness of the observed differences, for example, minimal clinically important difference (MCID) or change (MCIC), aka MIC. Data were extracted dichotomously based on whether a criterion was satisfied or not, except for "ES type" and "retest type," for which more than 2 levels were considered (eg, for "ES type," whether a Cohen d or Hedges' g or eta was calculated). The completeness of reporting clinical information about PwMS, such as degree of MS-related disability and disease course, was also appraised.

Data Analysis
The collected data were exported into a statistical software (SPSS 20, IBM Corp, Armonk, New York) and descriptive analyses were computed. To control for the expected differences in the quality of reporting depending on the publication date, four 5-year temporal intervals were predefined and compared using odds ratios adjusted for multiple comparisons. Odds ratios were also calculated comparing data by decade (2001-2010 vs 2011-2020). For all the comparisons, the significance level was set at P value of less than 0.05.

RESULTS
The process of study selection is displayed in Figure 1. Of the initial 526 RCTs retrieved after removing duplicates, 370 satisfied the inclusion criteria (see Supplemental Digital Content 3, available at: http://links.lww.com/JNPT/ A418). Main reasons for exclusion comprised administration of single-session interventions, short-term programs (<2 weeks), design other than RCT, and PwMS not assigned to rehabilitation. Figure 2 summarizes the main results of the analyses. From a methodological standpoint, a priori sample size calculations were provided in 148 of 370 RCTs (40.0%); a follow-up reassessment after discontinuing the intervention was planned in 128 (34.6%); standardized or unstandardized ES was reported in 173 (46.8%; of these, 138 studies reported the Cohen d, 29 the eta or partial eta, and 6 did not specify the ES type); the CI of the change was reported in 95 (25.7%); test-retest reproducibility of the measurements was directly determined in 16 (4.3%; of these, 5 studies examined sameday retest, 2 one-day retest, 3 one-week retest, 3 more than one-week retest, and 3 did not specify the time frame) and cited in the methods and/or discussion by referring to previous works dealing with the reproducibility of the outcomes employed in 55 (14.9%); measurements' responsiveness (ie, SEM; MDC) was determined in 70 (18.9%); and clinical meaningfulness of the observed change (ie, MCID/MCIC or MIC) was determined in 22 (5.9%) and cited in the methods and/or discussion by referring to previous works dealing with the clinical importance of the observed changes in 103 (27.8%). Trial registration in a registry prior to study commencement was declared in 141 (38.1%). Figure 2 summarizes data for each of the methodological, statistical, and clinical items surveyed. Finally, study limitations were clearly acknowledged in 309 studies (83.5%), where the most common study limitation acknowledged was the small sample size (177 studies of 309; 57.3%).
Regarding the reporting of disability degree, 299 out of 370 studies (80.8%) presented such information reporting EDSS score. Relatedly, the median disability was often presented stand-alone or as minimum-maximum range, without precise indices of dispersion. Regarding the reporting of the MS course, of the 370 RCT analyzed, 286 disclosed it, whereas 204 (71.3%) failed to report data depending on the MS course providing only merged data.

DISCUSSION
The main finding of the present survey is that several sources of weakness emerged in the way authors reported methods and presented data from RCTs dealing with rehabilitation for PwMS. Lack of transparency involved all the 3 predefined themes. Failure in reporting crucial clinical information, such as disability degree and MS course, was also found.

Study Methodology and Statistics P Value
The survey showed that most of the studies report the exact P value for the observed differences, in line with CON-SORT recommendations. However, it should be noted that the P value is a unitless, binary measure of the plausibility of a result, and is conventionally employed to measure statistical significance against a predefined threshold, generally 0.05. 13 Moreover, these group-level statistics could be accompanied by individual data analyses, especially when authors deal with small samples. Other indices of change, such as CI and ES, have been recommended to complement P values, as they provide a representation of the magnitude of an effect. 14-16 P values alone do not provide information on the magnitude of change, which is ultimately what is needed to determine clinical meaningfulness. Conversely, the CI width indicates the degree of the uncertainty, 16 with a narrow interval giving reassurance, whereas a wide interval reveals large uncertainty about the ES being examined.

Confidence Interval
Although the use of CIs has markedly increased in health research, [16][17][18] this did not apply to the MS rehabilitation literature here surveyed, as only 1 study in 4 (25.7%) complemented P values with CIs. To describe the amount of difference observed between the groups or the extent of the change in an outcome following an intervention, reporting CIs of the difference/change rather than that of the mean is advisable. Confidence interval of the change has the advantage to convey both statistical and clinical information to assist clinicians in determining the usefulness of the findings and their decision making. 18,19 It also provides researchers and clinicians with a more informative view of how much of an effect an intervention had, compared with observing only statistical significance. 19 Importantly, CIs are appropriate for parametric and nonparametric analyses and for both individual studies and aggregated data in meta-analyses. Therefore, it is recommended that when inferential statistics are performed, CIs of the change, both within-and between-groups, accompany point estimates and conventional hypothesis tests.

Effect Size
Approximately half of the studies reported the ES for the observed differences. This finding can be directly compared with a recent survey on the reporting quality of neurophysiological/transcranial magnetic stimulation studies that assessed individuals with neurological conditions, including MS: only 4% of the articles reported ES of the differences/changes. 6 This comparison suggests that authors active in the rehabilitation field may be more aware of the importance of not solely relying on P values.
The ES is an estimate of the magnitude of the change in a score following an intervention. 20 Its use is increasingly recommended. 21 Among the number of ES calculated, the most employed in clinical trials are the raw mean difference and the standardized mean difference. Raw mean differences use the scale of the original measurement, which allows judging the magnitude of effect and comparing data across studies that used the same metric. However, measurement methods are often dissimilar across studies. Standardized ES are generally preferred as they give indexes that are expressed in a common metric, that is, standard deviations. 21 Hedges' g and Cohen d are the 2 most common standardized mean difference statistics. They are similar as both are computed considering the mean and the standard deviation. The 2 statistics also have similar performances except for sample sizes less than 20, when Hedges' g performs better than Cohen d. 22 For this

Power Analysis
Forty percent of the studies surveyed performed power analyses to establish the least sample size of participants. This percentage raised to 44.8% after considering the number of RCTs where the authors declared that a pilot trial had been performed (41 of 370; 11.1%). Pilot feasibility studies are needed in ground-breaking studies lacking crucial a priori information, and, given their exploratory nature, they are generally not requested to run power analysis. In this regard, in their scoping review of clinical studies on physical activity and its benefits for PwMS, Learmonth and Motl 23 call for "more and more feasibility trials to substantially strengthen the foundation of research on exercise in MS prior to engage in large scale RCTs." However, while almost half of the studies predefined the minimum sample size to reach the least statistical power, the uncommon predefinition of the number of participants remains problematic. As a result, the findings generated tend to associate with considerable uncertainty and potentially flawed conclusions 24,25 so that, almost inevitably, the readers have become familiar with the common conclusion that " . . . future studies over larger samples are needed to confirm the findings." Accordingly, the most common study limitation acknowledged was the small sample, with half of the investigations associated with low statistical power. For successful pilot/feasibility studies worthy of being developed in larger RCTs, overcoming this issue would ensure that the observed differences/changes are less biased, the error less inflated, and the findings more reliable. 16,26 In this perspective, when foundation research is well available, the authors should attempt to validate the findings of the pilot studies over larger scales.

Reproducibility and Responsiveness of Measurements
Only 4.3% of the studies performed retest procedures to determine measurements' reproducibility. Subgroup analyses showed a significant decrease in the number of studies performing such procedures in the 2011-2020 decade compared with 2001-2010, possibly due to several common outcome measures being profiled in terms of their reproducibility and responsiveness. Indeed, not all intervention studies with PwMS need to conduct their own reliability analyses as a number of relevant outcomes, mostly relating to gait, mobility, MS impact, and quality of life, have been established in terms of their psychometric properties. 27 Other outcomes that are psychometrically established in other populations (eg, the elderly, other neurological conditions) may not be as stable and reliable in PwMS. 8 In these selected cases, reproducibility analyses with multiple baselines would be advisable. Better reproducibility results in higher precision of measurements, which is considered a critical prerequisite for tracking changes. 28 Single measurements can be collectively distorted by measurement error, which involves accuracy of the measuring instruments, tester's expertise, patient variabil-ity over time, testing protocol, and environmental conditions where the test takes place. 29 It is, therefore, critical to outline the measurements' reproducibility, that is, to what extent the findings of a test remain stable at retest, in the absence of an intervention, over a period that may be considered clinically meaningful. Measurements' precision, often estimated by the SEM, is the ability of a test to produce exact values. Failure in outlining reproducibility and measurement precision weakens the validity of the findings, undermining data analysis and interpretation, and practitioners' decision making. Hopkins 28 demonstrated that at least 50 subjects are generally required to be tested over 3 or more trials to provide adequate precision for the estimate of the change in measurement error.
Efforts to determine reproducibility are still uncommon in research conducted on neurological populations. Deriu and colleagues 6 reported that only 5% of the 571 neurophysiological/transcranial magnetic stimulation studies reviewed planned retest procedures to establish measurements' reproducibility. This finding is in line with the present survey, although we evidenced that a relatively larger number (14.9%) of RCTs dealing with MS rehabilitation tend to report measurements' reproducibility at least for the primary outcome, while discussing the observed changes in that outcome. By doing so, the authors give reassurance that the reproducibility of the measurements considered is known and possibly under control. However, in several instances, the test-retest study that they refer to had been carried out in populations other than MS, which in some way undermines the very ground of such reassurance. As said previously, this issue is even more relevant to PwMS, who are considered extremely variable in their neuromuscular performance 30 and display day-today fluctuations in their functioning, strength, and fatigue. 8,9 Accordingly, the poorly established reproducibility of the measurements taken from other populations of persons with neurologic diseases may potentially weaken the power of the studies, their ability to detect clinically meaningful changes induced by rehabilitation, and their clinical implications.
Although carrying out time-consuming and patientdemanding retest measures may not always be practicable due to intrinsic and extrinsic difficulties related to PwMS status (for example, fatigue, tiredness, spasticity), establishing measurements' reproducibility for those measurements for which psychometric profiles are lacking is important and could significantly enhance the accuracy and precision of the measurements taken and thereby allow optimal quantification of any changes induced by rehabilitation. 11

Clinical Meaningfulness of the Changes
Approximately 5% of the studies checked whether the observed change surpassed indexes of clinical importance, such as the MIC, which is the smallest change in an outcome that a patient would identify as meaningful. 10,[31][32][33] Also for this index, a significant decrease in the number of studies reporting it was observed from 2001-2010 to 2011-2020. Unlike reproducibility/responsiveness, for which a reduction of reporting in clinical trials is somewhat expected due to accumulation of test-retest observational studies, reduction of MIC reporting in the last decade is in sharp disagreement with the general impulse prompted in clinical research literature to aim for clinically meaningful rather than statistically significant results. 15,33 The MIC is currently considered the most appropriate estimate to evaluate changes over time within individuals or groups. 33 It can be determined in several ways, 34 mainly through anchor-based and distribution-based methods. 35 Briefly, the former require an independent standard or anchor (eg, the patient rating of change) that establishes whether the patient is better after treatment compared with baseline, according to his or her own experience. The distributionbased methods rely on expressing the magnitude of effect in terms of the underlying distribution, that is, by taking into account measures of variability of the findings, such as between-patient or within-patient variability. 31 Although the combined use of the 2 strategies is likely to enhance the interpretability of the change, the anchor-based approach is generally recommended, as it is more reflective of the patient's view. 33 Accordingly, reporting not only group-level but also individual-level data would allow the identification of responders, that is, those patients who managed to surpass a preset threshold for change, such as the MIC. As a counterargument to the calculation of MIC or other indexes of clinical importance in any clinical trial targeting PwMS, these should be established through studies that are completed on adequate samples of participants to avoid misleading thresholds derived from underpowered RCTs. The MS Outcome Measures Task Force (https: //www.neuropt.org/practice-resources/neurology-sectionoutcome-measures-recommendations/multiple-sclerosis) is a useful initiative that has reviewed the psychometric properties and clinical utility of a total of 63 measures for the use in clinical practice, entry-level education, and research. We advocate the referral to such initiatives when selecting outcome measures for clinical trials in the MS realm.
The debate on the clinical meaningfulness of the changes, however, seems to have only trivially made its way into MS rehabilitation literature, as only 5.9% of the reviewed studies attempted to determine the MIC. Importantly, we also found that almost 30% of the RCTs critically appraised their results against previously established MIC thresholds when discussing the amount of change detected and the practical importance of their findings. However, MIC cutoffs are still not available for many key clinical and functional outcomes, or are there for populations other than MS, thus justifying continued research in this field.

Study Limitations
The first limitation of the survey is that we narrowed the focus to the MS rehabilitation field; therefore, the present findings cannot be directly generalized to other pathological populations. Future studies should aim at verifying the generalizability of our findings in major neurological conditions, other than MS. Second, the term "rehabilitation" that we used as the main key word in our search strategy is an umbrella term that encompasses a wide range of interventions but may not include the whole spectrum. This choice resulted in retrieving RCTs that mainly dealt with physical rehabilitation and physical therapy and, to a minor extent, cognitive, behav-ioral, and nutritional interventions. Another limitation relates to restricting the survey to articles written in English. In addition, the design chosen for this study (retrospective thematic survey) does not allow identifying and understanding the potential reasons why authors active in MS rehabilitation do not provide enough methodological and statistical details in their reports. Future studies using a qualitative interview design may allow to better answer such relevant question.
Although the present work shares some of the items belonging to the structured checklists hosted by the EQUATOR Network, it also departs from its framework as we aimed at expanding on selected issues, such as reproducibility, responsiveness, and clinical meaningfulness, which are currently not covered in the checklists even though they can affect the quality of reporting of clinical trials. In this perspective, the themes here proposed and examined should not be viewed as alternative to tools like those from the EQUATOR Network, which hopefully will soon include items for the assessment of reproducibility, responsiveness, and clinical importance. One final limitation is that the quality of the journals that have published the articles here surveyed was not taken into account. Beyond the use of journal metrics, such as the impact factor, the H-index, or other emerging parameters such as the Scimago Journal Rank score, which are regarded as controversial ways to appraise the quality of a scientific journal, we admit that some difference in the reporting quality may exist between major journals with strict methodological requirements (including mandatory adherence to the EQUATOR checklists) and relatively minor journals with no predefined policies of reporting.

CONCLUSIONS
Despite the increasing awareness of the need for a complete and transparent reporting of clinical studies and the number of evidence-based initiatives to enhance its quality, RCTs dealing with MS rehabilitation still suffer from important limitations associated with methodological and statistical reporting, reproducibility of measurements, and clinical responsiveness. To counteract such weaknesses and potential threats to research validity and usability, we propose that not only major journals such as the Journal of Neurologic Physical Therapy but, overall, all the journals active at the intersection of neurorehabilitation, clinical neurophysiology, neurology, and neuroscience fully endorse valuable initiatives like those hosted by the EQUATOR Network by asking submitting authors to follow, complete, and upload the appropriate reporting guideline for the design of their study. Another initiative that shares many of the EQUATOR goals is the Physiotherapy Evidence Database (PEDro), which aims at facilitating evidence-based physiotherapy by promoting the best available evidence in physiotherapy clinical practice (https://pedro.org.au/). Trials indexed in PEDro are also rated for quality using the PEDro scale.
In line with EQUATOR and PEDro recommendations, the quality of reporting could be further enhanced by policies that mandate protocol registration in public registries, as well as data deposition and sharing. Along with increased compliance with structured guidelines for transparent reporting, we suggest that researchers active in the MS rehabilitation field spare no efforts in ensuring measurements that are not only accurate but also reproducible (via retest procedures, or referring to already established thresholds) and responsive to change (by determining indexes of measurements' variability, or recalling available cutoffs), which are key prerequisites to outline the error zone that surrounds any measurements and that needs to be exceeded to interpret change as reasonably induced by the administered intervention. On these premises, the next step would be to take patient's perspective into account by determining the least amount of change (ie, MIC) in a health or functional outcome that the patient would perceive as positively impacting his or her status. Hopefully, adding these actions would advance the field of rehabilitation in PwMS through enhancement of our ability to determine clinical meaningfulness of the changes that are observed following rehabilitation.