Statistical analysis is necessary to quantify observed differences in demographics and histologic and molecular features. Professional help with statistical problems is available to many surgical pathologists and should certainly be sought; however, choosing and interpreting the relevant statistical tests is an invaluable skill for the surgical pathologist. The ability to perform statistical analysis should both enhance the understanding of the published literature and the ability to design and carry out one’s own research projects.

There are several reasons physicians may feel inadequate when it comes to understanding statistics. Familiarity with statistical analysis is perhaps best achieved by manipulation of a familiar data set, and there is often a time lapse between a didactic lecture in medical school and the subsequent application to the interpretation and planning of research. Also, some of the statistical tests necessary for understanding differences in diagnostic categories, such as interobserver reproducibility and assessment of error rates, are relatively esoteric and may not have been included in medical school statistics classes.

Mercifully, standard surgical pathology research requires a limited number of statistical approaches and mastering a core set of techniques will go a long way in helping the practicing pathologist understand the statistical and clinical significance of research results. Several software programs are available to assist in performing the range of statistical functions required for the practicing pathologist. These include: add-ons to existing software packages (eg, Excel), commercially available stand-alone packages (eg, SPSS, STATA, JMP), open-source implementation (see below), and web-based interfaces (eg, graphpad). Each has advantages and disadvantages and all will require some degree of training before they can be readily used.

As practicing surgical pathologists with no special training in statistics, we have found the free software package R to be useful. The open-source framework yields unparalleled flexibility and the community of users have developed an immense collection of packages capable of performing all ways of statistical testing from basic summary statistics to complex gene expression analysis. As a practical matter, because statisticians often use R, this may permit more meaningful consultations with statisticians. The disadvantage of R’s flexibility is a potentially steep learning curve. R executes functions through a command line interface and beginning users often struggle with basic aspects of file extensions, data importing, and manipulation of data tables. The user needs to develop at least a cursory understanding of object-oriented programming and learn the basic syntax (vocabulary) necessary to properly execute R commands.

The purpose of this manuscript is to familiarize the reader with the R program and show how it may be used to analyze surgical pathology data using papers recently published in the *American Journal of Surgical Pathology* (AJSP). After reviewing the following basic exercises; comparable data could easily be “plugged in” and meaningful statistical analysis undertaken. We also provide additional information that would permit interested readers to explore and develop greater facility with R.

#### DOWNLOAD R

Go to The Comprehensive R Archive Network (CRAN): http://cran.r-project.org/and download the latest release “Download R 2.14.2” for Linux, Mac, or Windows to your desktop. Run the installation file and install using the default settings. Be sure to note the directory where you route the files, especially if you choose not to include a desktop icon.

#### USING R

Click on and download the manual as a .pdf file from the left menu at CRAN: http://www.r-project.org/ and do the introductory session in Appendix. This can be skipped initially, but it will help familiarize the user with R’s capabilities. Next, click on the R icon on the desktop to open the application.

This will open the console. Now place the graphical user interface on the left-hand side of your desktop and this document on the right as shown in Figure 1. This way, commands in the document can be typed directly in R and the results observed. When we get to analysis of previously published surgical pathology papers it will be very helpful to have a hard copy of the paper as well for reference. Before we begin, let us create a folder on the desktop to store the data we will need for these exercises. On your desktop create a new folder called “R files.” Next, return to the R graphical user interface and change the working directory by selecting “File” then “change dir” from the pull-down menu. Follow the prompts and select the new (currently empty) folder you just created called “R files.” This will allow the program to find the files you will be referring to complete these exercises.

The command prompt is where the user enters instructions to the computer. The computer displays the “>” character to denote a line for the user to input an instruction.

Now you can execute your first command by double checking that you are in the correct working directory. At the command prompt (>) type:

>getwd() #press enter

*Note*. The computer ignores text that follows a # sign; this is a useful way to annotate your code.

The computer will output the result of the command in the line below, in this case the working directory, or the folder where it will take inputs and output data, for example:

“C:/Documents and Settings/Desktop/R files”

#### CREATE SOME SIMPLE OBJECTS

R is a statistical computing language that consists of defined functions (in this case statistical tests) performed on objects. An object is a collection of attributes (data) that you will define, upon which to perform statistical tests. For example, if you have an Excel spreadsheet that contains a list of cases and variables such as: patient age, sex, tumor stage, and IHC results, you would need to create an “object” in the R environment that contains all of your data. You may create a single object that contains all of the data or if you want to manipulate a portion of the data you may choose to create multiple objects each containing pieces of the data. You can type in these data at the command prompt, but more likely you will be importing files from a database application or an Excel spreadsheet format.

In order to explore the functionality of R, we will practice entering some data. This will demonstrate how R receives commands and delivers results. Then we will move on to pathology data sets.

A central tenet of R is that everything in R is an object and every object has a class. What you can do with an object depends on its class. You create objects by assigning values to them with the notation “<-” alternatively you may can also use an equal sign “=.”

For example:

> x<-50# or x=50

This creates an object x with an implicit “numeric” class, that is you can perform numeric operations on x.

> y<-5

> z<-x+y

When you enter the name of an object R returns its value:

> x

[1] 50

> y

[1] 5

> z

[1] 55

You can also enter what’s called an expression without making an assignment:

> x+z

[1] 105

Or find the class of an object:

> class(y)

[1] numeric

which will have more meaning as we go on.

Or get a list of all the objects you have created:

> ls ()

[1] x y z

You can scroll through, reexecute, and alter past commands by pressing the up arrow to the right of the keyboard.

Objects can also be a group of numbers, which you can create using defined functions. For example, a sequence from −100 to 100 in 0.5 integer steps can be entered with the function seq; rather than manually entering the entire series, which would be performed, albeit in a more laborious manner, with the function c() (which stands for concatenate).

> x<-seq(−100,100,0.5)

Assuming you know the class of your object, you can perform functions on it, for example you can raise the “x” object to the power of 2 (obviously this only works for an object of the class “numeric”):

> y<-x^2

and plot:

> plot(x,y)

You can also create a random normal distribution of 1000 elements with a mean of 5 and an SD of 1:

> x<-rnorm(1000,mean=5,sd=1)

and check to see what it looks like by typing “x” to get a simple list of the elements or plotting it as a histogram:

> hist(x)

If you create another random normal distribution with a slightly different mean:

> y<-rnorm(1000,mean=5.5,sd=0.5)

you could superimpose this on the original plot to see what it looks like (Fig. 2):

>hist(y,add=TRUE,col=red)

and then see whether there is a significant difference between x and y:

> t.test(x,y)

Welch Two Sample t-test

data: x and y

t=−13.7288, df=1461.31, p-value <2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

−0.5668884 −0.4251457

sample estimates:

mean of x mean of y

5.013888 5.509905

This brings up an important point. You may have noticed that we reused the variable named “x” several times in the preceding example. Reusing “x” overwrites any previously defined objects of the same name. The significance of this feature is that the sequence in which you define objects determines the resulting output of the functions you execute. For example, we plotted a histogram of “x” after creating a randomly distributed set of 1000 numbers. This histogram would have looked quite different had we entered the same command, hist(x), when x was defined as the number 50 (give it a try). The take home lesson is to be careful about naming your variables. In this case, if you would like to continue using “x,” you could name your second object x1, x2, …, etc.

You can learn about individual commands either by typing a question mark (?) before the command, which will call an HTML documentation page, or by leaving out specified parameters to utilize the default settings. For example:

> x<-rnorm(1000,mean=5,sd=1)

> mean(x)

[1] 5.009657

But:

> x<-rnorm(1000)

> mean(x)

[1] 0.01596601

> sd(x)

[1] 0.9794425

Showing that the default value for the mean of x is 0 and for the SD is 1.

#### IMPORTING DATA SETS

Rather than manually entering data into R, as in the previous examples, it is often desirable to import files that contain all the variables collected in a study set. R has several functions to import existing data files including: read.delim, read.csv, and read.table, each designed to address unique aspects of the import file structure.

Getting your data into the R environment is often a frustrating experience for the beginning user and there are several things you should consider before importing data. (1) What is the file format of the data you are importing? (2) Are there missing values? (3) Are there column headers? (4) Are the variables you are importing numerical or categorical? (5) What is the location of the files you wish to import?

Because most surgical pathology research studies will be conducted by entering data into a spreadsheet application (Microsoft Excel), we will use the function read.delim; a generic function for importing tabular data (the default setting for this function assumes a tab-delimited file format, that is a plain text file in which each number or data value is separated from the next by a tabbed space). Let us begin by uploading the supplementary file included with this review. First download the file (Supplemental Table 1, Supplemental Digital Content 1, http://links.lww.com/PAP/A2) and be sure to direct the file into the “R files” folder you just created on your desktop. Now create a new object in R which we will call “data” that contains a path to this file.

data<-read.delim(file=R_review.txt,header=T)

This command creates an object that contains all the data from a recent study in the form of a data frame (similar to opening an Excel file), where the rows correspond to individual cases and the columns to variables. If you execute this command correctly, a new line with a blank command prompt should appear. This illustrates a general design feature in R, in which there is no response when you are doing things, but incorrect entries will produce an error message.

To view the data in a more familiar format try the following:

>fix(data)

This command opens the R data editor in a new window that can be used to visualize the raw data, determine the attributes of each column (numerical or categorical) and edit erroneous or missing data.

#### PERFORMING BASIC SUMMARY STATISTICS

Most papers contain summary statistics of basic demographics of the study population. This establishes baseline differences in the cohort that may impact the study outcome. Let us work with some of the data we already imported to generate some basic summary statistics necessary for surgical pathology research. First, let us summarize the data.

>summary (data)

This is a partial print out of the summary data that should appear on your screen if the command was successful (your screen should also include data about AJCC staging, Grade, and IHC for CDX2 and β-catenin). As you can see, this command provides the range, mean, median, and first and third quartile for each continuous variable and provides absolute counts for each categorical variable in the data table. You will notice that not all of the information from the summary command makes intuitive sense. For instance the gross size, although recorded in centimeters, was not reported as you would expect for a continuous numerical variable. Before we correct this problem it is also possible to calculate statistics on individual columns in the data table. The following command computes the SD of Age. The “$” notation indexes the Age variable from the data frame (data).

>sd(data$Age)

[1] 14.51929

Next let us try to recalculate the mean gross size of the tumor.

> mean(data$Gross.Size.cm)

[1] NA

Warning message:

In mean.default(data$Gross.Size.cm):

argument is not numeric or logical: returning NA

The warning message provides some guidance about why this calculation was not successful. Let us take a look at our data frame to appreciate the characteristics of the data that are leading to this warning message.

>fix(data)

This command should bring up the R data editor. Next click on the column header title “Gross.Size.cm” which should bring up a window describing the variable type (Fig. 3). You will notice that this variable has been designated as a character variable. Unless otherwise specified during the import process, R will guess what type of variable you are trying to represent. In this case, because gross size includes missing values (designated as “NA”), R incorrectly interpreted this variable as a character variable. You can correct this by redesignating the type as “numeric” as illustrated in the above screen shot. After you close the data editor this change will be saved in your data frame. You can also correct this problem during the data import process (refer to documentation for read.table)

>?read.delim

There is one more step necessary before you can calculate the mean, which is to tell R that the data set contains missing data encoded as “na.” The exact treatment of missing data varies depending upon the function call and detailed explanations are available in the R documentation.

> mean(data$Gross.Size.cm,na.rm=T)

[1] 4.147458

#### PROPORTIONALITY TESTING (**χ** ^{2} AND FISHER EXACT TESTS)

To gain some practical insight into choosing and interpreting statistical tests, let us use a recent paper published in the *AJSP* ^{1} that describes the prognostic importance of tumor budding in colorectal carcinoma. To determine whether significant differences exist between groups of morphologic features it is necessary to construct a contingency table to test for significantly skewed proportions. Let us begin with a simple 2×2 table using 2 common methods; the Fisher exact test and the χ^{2} test.

From Table 1 in the paper by Wang et al., set up the following table in R:

Which you can do in this way:

>lymph<-matrix(c(52,28,19,29),ncol=2)

and then test by χ^{2}:

> chisq.test(lymph)

Pearson’s Chi-squared test with Yates’ continuity correction

data: lymph

X-squared=6.8507, df=1, p-value=0.00886

This is different from the reported *P*=0.005. You can repeat the test without the Yates correction to closer approximate the reported results.

> chisq.test(lymph,correct=FALSE)

You could also do a Fisher exact test:

> fisher.test(lymph)

Fisher’s Exact Test for Count Data

data: lymph

p-value=0.006111

alternative hypothesis: true odds ratio is not equal to 1

95 percent confidence interval:

1.272033 6.351190

sample estimates:

odds ratio

2.810414

In this instance both tests give similar results. In general, choosing which test is most appropriate will depend on the size of your data set, number of categories, and types of variables.

A slightly more complicated example from that same table is low budding in the rectum (18), left colon (12), and right colon (41) versus high budding in the rectum (10), left colon (23), and right colon (24).

>site.high<-c(10,23,24)

>site.low<-c(18,12,41)

> site.total<-site.high+site.low

then:

> prop.trend.test(site.high,site.total)

Chi-squared Test for Trend in Proportions

data: site.high out of site.total,

using scores: 1 2 3

X-squared=0.3017, df=1, p-value=0.5828

#### INTERRATER RELIABILITY (κ VALUES)

The use of interrater reliability as a metric of concurrence in surgical pathology offers several advantages. In comparison with reporting overall rates of agreement, κ statistics estimate the proportion of agreement that may be due to chance and allow the calculation of *P*-values and confidence intervals. Of particular relevance to surgical pathology research, κ values may be preferable to calculating sensitivities and specificities, which require an objective gold standard. It is important to remember that κ values are not probabilities or proportions of agreement and therefore, what represents a good value is open to debate.

First, download the “irr” package from the R site. This is most easily done by going to the “Packages” pull-down menu in R and selecting “Install package(s).” Pick a geographic location near you from the list and finally select the package(s) you want. You can also download a pdf from the R website (eg, http://cran.cnr.berkeley.edu/) that will explain the irr package and demonstrate its features.

Once you have done that you need to open the package in R:

> library(irr)

This gives you access to all the data and functions in this package. We will begin with an example from the psychiatry literature, included in this package. Then we will use a recent paper from the AJSP. Load the data set called “diagnoses”

> data(diagnoses)

Examine the data

> fix(diagnoses)

This is as seen in Figure 4. Alternatively you can export the data and view it a spreadsheet application such as Excel. The following command creates a tab delimited text file called “diagnoses” in your current working directory; which should be the “R files” folder on your desktop.

>write.table(diagnoses,file=diagnoses.txt,sep=\t,row.names=FALSE)

You can open this file, make changes in Excel, save the file as a tab-delimited text file, and then read it back into R:

>diagnoses.altered<-read.table(file=diagnoses_altered.txt,sep=\t,header=TRUE)

In whatever format you prefer; notice that the data consist of 6 different raters characterizing 30 patients into 5 mutually exclusive categories; in this case, psychiatric diagnoses.

Calculating κ values from these data is easy:

> kappam.fleiss(diagnoses)

Fleiss’ Kappa for m Raters

Subjects=30

Raters=6

Kappa=0.43

z=17.7

p-value=0

and with more information, giving κ values for each individual diagnostic category:

>kappam.fleiss(diagnoses, detail=TRUE)

Or to use an actual example from the AJSP: open the pdf of Van der Kwast.^{2} This article examines the diagnostic reproducibility of small atypical collections of glands in prostate needle biopsies by experts and community pathologists.

Highlight Table 2 and copy and paste into notepad. Open the file as a space-delimited file in Excel and do a few modifications until you have a file of the same format as the one exported from the above psychiatric example that is 1 column for every rater with 20 scores below. Divide Table 2 into 2 parts, 1 for the experts (a) and 1 for the group pathologists (b). Keep the column names for the rater but do not keep the row names for the cases. Then save these as text (tab-delimited) files Kwast_Table_2a and Kwast_Table_2b in R’s working directory. (Correctly formatted tables can be downloaded at Supplemental Digital Content 2, http://links.lww.com/PAP/A3).

Then import them into R with:

>kwast2a<-read.table(file=Kwast_table_2a.txt,sep=\t,header=TRUE)

And determine the κ values:

> kappam.fleiss(kwast2a,detail=TRUE)

Fleiss’ Kappa for m Raters

Subjects=20

Raters=5

Kappa=0.388

z=7.6

p-value=3.11e-14

It is totally trivial to repeat with Supplemental Table 2b.

In terms of interpreting the results of these statistical tests; Landis and Koch^{3} have proposed a rating scheme:

Although it is tempting to refer to scaled rating systems, such as this one, several factors aside from the level of interobserver agreement influence the magnitude of κ. These include the prevalence of conditions being rated, bias in class assignment, and lack of independence between ratings. These confounding factors make it difficult to compare κ statistics from one study to the next and underscore the fact that the κ values may misrepresent the implied value of a rating scheme.

The above examples used the Fleiss method to calculate κ values although there are other methods available that can be explored in the IRR package. The correct choice of statistical testing will depend on individual study characteristics including: the type of variables in the rating scheme, number of categories evaluated, and the number of raters.

#### SURVIVAL ANALYSIS

Survival data and functions also have their own library but this time it comes with the general R package and it is not necessary to download a new package like we did with “irr.” Instead just load the survival package from the internal library of functions.

> library(survival)

This package contains many of the essential functions necessary to analyze survival data. We will focus on univariate and multivariate analysis with Cox proportional hazards models and plotting Kaplan-Meier survival curves.

To explore these models we will use a data set of intestinal NK and T-cell lymphomas.^{4} This paper reported an overall 1-year survival of 36% and univariate regression analysis showed that NK-cell lineage, multicentricity, and perforation were associated with poor prognosis. NK-cell lineage was the only significant prognostic factor by multivariate analysis.

Table 1 gives enough data to perform an analysis. To use it, copy the table and paste into Notepad and save. Then open as a space-delimited file in Excel. Some modification is necessary but it is not too laborious. Then save again as a text (tab-delimited) file. You can also just use the table provided in Supplemental Digital Content 2, http://links.lww.com/PAP/A3 and Supplemental Digital Content 3, http://links.lww.com/PAP/A4.

Then import it into R:

>table1<-read.table(file=Chuang_table.txt,sep=\t,header=TRUE)

Attach the table so it is easier to work with column headings:

>attach(table1)

As we have done before, you can use the “fix()” command to view the table in the R data editor. To perform survival analysis, first we will need to create an object that contains the necessary information.

> x<-Surv(months,as.integer(Outcome)==2)

This function creates a “survival object” called x, which contains the follow-up time in months, and the status indicator for each patient (Outcome). The notation “as.integer” is necessary to coerce the Outcome variable into a number which the Surv function requires. Now the outcome variables are coded as numeric variables such that DOD(dead of disease)=2, AWD(alive with disease)=1, and NED(no evidence of disease)=3.

This object can now be manipulated to perform Cox proportional hazards modeling and plot Kaplan-Meier survival curves. The paper reported that NK-cell lineage was a significant adverse prognostic indicator in the context of intestinal lymphoma. We can perform a univariate analysis by diagnosis as follows:

>summary(coxph(x∼Dx))

Call:

coxph(formula=x∼Dx)

n=30

*R* ^{2}=0.145 (max. possible=0.987)

Likelihood ratio test=4.71 on 2 *df*, *P*=0.0948

Wald test=5.39 on 2 *df*, *P*=0.0677

Score (log-rank) test=5.97 on 2 *df*, *P*=0.0505

The above command constructs a Cox proportional hazard model stratified by diagnosis. The summary() function prompts R to perform 3 different tests for statistical significance. We can see from the results that the score test (logrank) is the only significant test at the *P*<0.05 level. These differences reflect alterative assumptions in calculating the baseline hazard of the overall population and which is most appropriate will depend on the data set. The results of the univariate analysis, reported in Table 3 of the paper by Chuang et al., do not match the current model.

Although the study included 2 cases of enteropathy-associated T-cell lymphoma, the univariate analysis combined the T-cell lymphomas into a single group of peripheral T-cell lymphomas. To recapitulate this analysis we will have to reassign cases 7 and 23.

> Dx[7]=Dx[23]=PTCL

Now we can repeat the previous hazard analysis.

summary(coxph(x∼Dx))

Call:

coxph(formula=x∼Dx)

n= 30

*R* ^{2}=0.134 (max. possible=0.987)

Likelihood ratio test=4.31 on 1 *df*, *P*=0.0379

Wald test=5.07 on 1 *df*, *P*=0.0244

Score (log-rank) test=5.6 on 1 *df*, *P*=0.0179

The hazard ratio comparing NK to peripheral T-cell lymphoma is given by the “exp(coef)” which in this case is 3.08 (very close to what was reported; 3.07) and the Wald test most closely approximates the reported *P*-value (0.025).

Now that we have successfully performed our first univariate analysis, you can repeat this process for each column in the data table to determine the hazard ratios and significance of the remaining reported variables. Specifically, the study reports multicentricity and perforation as significant additional univariate predictors of outcome. We find that perforation is associated with a 3.97 risk of dying (*P*=0.031; Wald test), whereas multicentricity is associated with a 2.9-fold risk (*P*=0.0268; log-rank test) both similar to the reported values.

The coxph() function can also be used to perform multivariate analysis. Let us create a multivariate model utilizing the variables found to be important in univariate analysis.

summary(coxph(x∼Dx+Perforation+Multicentricity))

Call:

coxph(formula=x∼Dx+Perforation+Multicentricity)

n=30

*R* ^{2}=0.35 (max. possible=0.987)

Likelihood ratio test=12.9 on 3 *df*, *P*=0.00479

Wald test=11.8 on 3 *df*, *P*=0.00824

Score (log-rank) test=13.9 on 3 *df*, *P*=0.00303

We find similar results to what was described in the paper. Notice that while the *P*-values for the overall model are significant (reported at the bottom) only NK-cell lineage (hazard ratio=3.05; *P*=0.033) and multicentricity (hazard ratio=2.677; *P*=0.04) remain significant multivariate predictors of survival.

In addition to calculating hazard ratios, R is capable of plotting Kaplan-Meier survival curves utilizing the same survival object we already created for Cox proportional hazard models.

> plot(survfit(x∼1))

This command simultaneously performs 2 functions on the survival object (x). First the function “survfit()” fits survival times to the Kaplan-Meier estimator. The second function plots the survival curve for intestinal lymphomas, with confidence intervals (Fig. 5). Using this approach we can confirm the reported 1-year survival for intestinal lymphomas.

Since we have already performed statistical modeling of the data with the Cox proportional hazards model we know that NK cell lineage is a significant predictor of survival. We can also show these results as a Kaplan-Meier plot

> surv.bydx<-survfit(x∼Dx)

>plot(surv.bydx)

Now that we have stratified survival curves we need to evaluate the significance of the differences in outcome based on diagnosis. The log-rank test is a statistical test that evaluates the significance of separation between 2 survival curves. This approach differs somewhat from the Cox proportional hazards model that assumes a baseline population hazard rather than comparing the survival between defined populations.

Type the following function to check for a statistical significant difference between NK and PTCL.

>survdiff(x∼Dx)

Call:

survdiff(formula=x∼Dx)

χ^{2}=5.9 on 1 *df*, *P*=0.0147

This confirms the significant difference in outcome between intestinal lymphomas stratified by lineage. Putting it all together, in addition to basic curve plotting, ability the plot() function can also create customized graphics.

For example by adding colored lines:

>plot(surv.bydx,lty=1:2,col=2:3,main=Survival in Intestinal Lymphoma)

There are 2 ways to add figure legends.

This command allows you to click on the plot to add where you want it:

>legend(locator(1),c(NK,PTCL),lty=1:2,col=2:3)

Or you can also provide the plot coordinates as illustrated and the *P*-value from the log-rank test.

>legend(60,0.4,p=0.0147)

You should now have a plot that looks like Figure 6. This can be saved as a publishable tif image using the pull-down menu under “file” in the R console.

##### If You Want to Learn More

1. The general introduction to R from the user's manual in Appendix is a good start. An introduction to R^{5} is also available through the help pull down menu in the R application.

2. Reference by Dalgaard^{6} is also excellent although much of the statistics is not necessarily applicable to surgical pathology.

3. Bioconductor^{7} is also a very helpful website for those involved in array research.