Operating room (OR) anesthesia involves clinical and managerial decision making that relies on communication over periods <5 minutes. Whereas traditionally many organizations have used overhead pagers and numerical pagers with local antennas for communication, such communication tools are inadequate for computer-assisted decision support. The analysis of latencies between message transmission and receipt is an essential component of practical and regulatory assessment of clinical and managerial decision-support systems^{1}^{–}^{3} in anesthesia.^{4} Quantifying latency precisely is important because of its sometimes large but consistently complicated (nonlinear) impact on resulting decisions.^{5},^{6} Such studies necessarily include both electronic and human components. Organizations implementing near-real-time notification systems (e.g., as part of a clinical decision-support system) need to compare methods of message delivery (e.g., alphanumeric pagers and voice communication devices) at their facility and select the most reliable approach.

Statistical methods are available to analyze these types of data. For a single communication process, percentages of latencies exceeding managerially important (although arbitrarily chosen) targets can be calculated along with corresponding confidence intervals.^{1},^{7} However, these methods are limited in that they are for single groups (i.e., not to compare percentiles among alternative communication methods). Furthermore, they do not quantify mean latency. The mean is important because for a finite population of patients, pages, and/or notifications, the total notification delay time is proportional to the mean. At first thought, an analysis of variance (ANOVA) would seem to be the appropriate statistical approach for the comparison of the mean latencies among several different communication systems. However, the probability distributions of latency measured in seconds or minutes are markedly nonnormal (e.g., log-normal) and the variances are substantially different among groups, violating important assumptions of this method. We show that generalized pivotal methods are a more appropriate method of analysis for studies of anesthesia providers' latencies of response to communication technologies.

## Motivation—Means of Log-Normally Distributed Data

Before performing a full ANOVA on means from several log-normal distributions, we demonstrate the failures of traditional analysis by using an example with a simple comparison of 2 means.

St. Jacques and colleagues compared the time it took anesthesiologists and certified registered nurse anesthetists at Vanderbilt to respond to alphanumeric page versus audible notification from their Vocera® hands-free voice over Internet devices (Vocera Communications, Inc., San Jose, CA).^{8} The following are *g* = 2 groups' times for response:

- For
*n*_{1} = 30 Vocera calls, the mean ± SD of observed latencies was 0.503 ± 1.252 minutes, giving a coefficient of variation of 2.49.
- For
*n*_{2} = 43 alphanumeric pages, the observed latencies were 1.968 ± 1.827 minutes, giving a coefficient of variation of 0.928.

From these data, the authors concluded that the latency of response to the alphanumeric pages was about 4-fold longer than with the Vocera calls.^{8} The data are reported to 3 significant digits so that readers can repeat the calculations, below.

Latencies of communication devices often follow 2-parameter log-normal distributions. Figure 1 shows 2 such distributions, with means and standard deviations matching those observed^{8} for Vocera and alphanumeric pager. Log-normal distributions (e.g., as shown in Fig. 1) are described by the parameter μ and ς^{2}, the mean and variance of the natural logarithms of the measurements in the time scale. If *X* represents latencies measured in the time scale (e.g., minutes), then the mean latency Mean (X) depends on both μ and ς:

The exp(μ) is the median of the distribution. The coefficient of variation in the time scale, *CV*, depends on ς, the SD in the log scale^{9}:

Rearranging equation (2) permits estimation of ς for the Vocera and alphanumeric pager data from the observed coefficients of variations:

Substituting in the measured values of *CV* = 2.49 and 0.928, respectively, the estimates for the standard deviations of latency in the log scale are 1.404 for Vocera and 0.788 for alphanumeric pager, respectively. Next, equation (1) is rearranged:

Substituting in the measured mean latencies in the time scale (0.503 and 1.968 minutes) and the estimated standard deviations in the log scale (1.404 and 0.788), the estimates of μ are obtained: −1.673 for Vocera and 0.367 for alphanumeric pager, respectively. The graphs of the log-transformed latencies in Figure 2 were created using these estimates.

The true values are unknown for the means and standard deviations in the time scale and in the log scale, only the sample means and sample standard deviations of the observations. Because this distinction is important, explicit terms are used for the sample estimates. The sample means in the time scale, given above, are represented by x[Combining Macron]_{1}=0.503 for the Vocera data and by x[Combining Macron]_{2}=1.968 for the alphanumeric pager data. The sample means and standard deviations in the log scale are represented by y[Combining Macron]_{1}=−1.673 and s_{1}=1.404 for the Vocera data and by y[Combining Macron]_{2}=0.367 and s_{2}=0.788 for the alphanumeric pager data.

Comparison of Figures 1 and 2 shows how difficult it is to visualize the relationship between the normal distribution's bell-shaped curve of the logarithm of latency (Fig. 2) and the latency itself in units of minutes (Fig. 1). Figure 3 plots distributions for the logarithm of latency, as in Figure 2, but uses a horizontal axis of time in minutes, rather than an axis of the natural logarithm of minutes.

The dotted lines in Figure 3 show the cumulative probability distribution function curves for the 2 log-normal distributions. The cumulative distribution curves show that the areas under each of the log-normal probability density functions equal 1. The median is the value of the latency for which the value of the cumulative probability distribution equals 50%. An important property of log-normal distributions is that the median in the log scale equals the mean in the log scale (i.e., both are μ). The medians are shown by the vertical lines. The means of the latencies themselves, in the time scale, are given by equation (1) and are shown by the black (Vocera) and red (alphanumeric pager) arrows in Figure 3.

What matters economically when choosing among devices for potential use in a notification system is the magnitude of differences in time units. For example, St. Jacques and colleagues refer repeatedly to the 4-fold difference in mean latency of response between groups. Although they reported the result of a statistical analysis (*P* < 0.001) once in a table,^{8} they focused appropriately on the magnitude of the difference. As summarized in the Introduction, above, the mean is the appropriate summary statistic, because for a finite population of patients, pages, and notifications, the total notification delay time is proportional to the mean. Because the sample sizes are not hundreds (see Discussion) and because there is considerable variance in the latency for both groups, the sample mean is reported along with the confidence interval for the mean.

The difference of the means in the log scale equals μ_{2}−μ_{1}. The sample (observed) difference equals y[Combining Macron]_{2}−y[Combining Macron]_{1}. The 100(1−α/2)% 2-sided confidence interval for μ_{2}−μ_{1} equals

where z_{α/2} is the α/2 percentile of the normal distribution. For example, if α=0.05, then the 95% confidence interval is obtained by using z_{α/2}=1.96. Substituting in the y[Combining Macron]_{1}, y[Combining Macron]_{2}, s_{1}, and s_{2} from above gives a 95% confidence interval for μ_{2}−μ_{1} of 1.48 to 2.59. The lower limit of the difference (1.48) is shown by the length of the lower green arrow in Figure 4. The upper limit of the difference (2.59) is shown by the length of the blue arrow. Taking the exponential gives the confidence interval in the time scale:

Substituting y[Combining Macron]_{1}, s_{1}, y[Combining Macron]_{2} and s_{2} into equation (3) results in the 95% confidence interval of 4.4 to 13.4. The interval straddles

=7.69, the ratio of the sample *medians*. However, our interest is in calculating the confidence interval for the ratio of the means: Mean(X_{2})/Mean(X_{1}). The ratio of the sample means is x[Combining Macron]_{2}/x[Combining Macron]_{2}=1.968/0.503=3.9. That ratio does not even fall within the range of the 95% confidence interval, 4.4 to 13.4 minutes! Only if the coefficients of variation in the time scale are identical between groups, or equivalently the standard deviations in the log scale are identical, is equation (3) a confidence interval for the ratio of the means in the time scale.

The confidence interval for the ratio of the means of 2 log-normal distributions can be obtained using generalized pivotal confidence intervals.^{10} Equations (4) and (5) in Appendix 1 give the steps to calculate a confidence interval for the ratio of 2 means using the pivotal method.^{11} Using those equations, the calculated 95% confidence interval for the ratio of the means in the time scale, Mean(X_{2})/Mean(X_{1}), is 1.4 to 7.7, which straddles the observed ratio x[Combining Macron]_{2}/x[Combining Macron]_{2} of 3.9.

This example highlights that to perform ANOVA of mean latencies, or any other log-normally distributed observation, it is not sufficient simply to take the logarithm of the observations, undertake the usual calculations, and then take the exponential of the result. That works only if the variances in the log scale are close to equal. That does not apply for communication latencies.

## Example Data

When a case was in an OR for longer than the scheduled duration, an instant message appeared on the anesthesia information management system (AIMS) workstation asking for the anesthesia provider's prediction of the time when the case would end.^{2} These messages were so-called “negotiated” interruptions, meaning that the computer announced its need to interrupt the respondent, and then the respondent chose whether to deal with the interruption immediately or later. Although implementation of the instant messages was done without education, notification, or buy-in, anesthesia providers sent a response for all messages.^{2}

We study the time to acknowledgment of the message and compare 2 levels of a treatment factor: whether the message was nonanchored or anchored. The nonanchored message (coded 0) asked for a time estimate. The anchored message (coded 1) gave recommended bounds for the time estimate, resulting in more words and statistical complexity, but an easier decision.

As described in the previous paper,^{2} there appeared qualitatively to be no difference in response time between treatments (anchored or not), based on presence of a prior message or not. Thus, the data were pooled. However, no statistical analysis was performed to support the validity of the pooling, because the appropriate statistical method was unclear to us.^{2} In this paper, we examine how to perform such a fixed-effects ANOVA.

The ANOVA was performed for the 472 of the 560 messages with response times under 5 minutes, and with 0 or 1 previous messages. Summary statistics for the 4 groups used in the ANOVA are given in Table 1. We consider the remaining 88 data in the following Mixture Distributions section. Those data include 20 messages with 2 prior messages and 2 messages with 3 prior messages.

Sample sizes are larger than 20 for all 4 groups (Table 1), just as for the data^{8} in the Motivation—Means of Log-Normally Distributed Data section. This criterion is important because generalized pivotal methods are inaccurate for comparing groups unless the sample sizes are larger than 10 per group (see Appendix 1).^{12}

The latency has a minimum of only 0.13 minutes, and the overall mean is 1.30 minutes. Because the minimum latency is close to zero, any shift to the left before the logarithm is taken would itself be close to zero. This shift is the log-normal distribution's third parameter and is effectively equal to zero (i.e., can be neglected).^{13} The result is that we study 2-parameter log-normal distributions, as shown in the preceding Motivation—Means of Log-Normally Distributed Data section and in Figures 1 to 4. Two-parameter log-normal probability plots of the observations are shown in Figure 5. The fit matches a log-normal for all 4 groups (Anderson–Darling test *P* ≥ 0.179).

The Appendix 2 gives Li's equations for the fixed-effects 2-way ANOVA with 2 levels for the treatment factor (referred to as *factor 1*) and *g* levels for the group factor (referred as *factor 2*).^{14}

## Analysis of Variance

Table 2 summarizes the results of the pivotal fixed-effects ANOVA. The test shows that the 4 mean durations are not statistically different (*P* = 0.683). At this stage, the analysis could stop, because there are no differences to explore. However, to permit comparison of results to traditional ANOVA in the next paragraph, we perform tests on interactions and main effects. As expected, the proportional effect of anchoring does not depend on the number of prior messages (*P* = 0.913). There are no significant effects due to the type of anchoring (*P* = 0.815) or due to the presence of prior messages (*P* = 0.335).

Table 2 also includes pivotal confidence intervals for the logarithm of the ratio of the 2 treatments (anchor versus no anchor) for each group *j*. The confidence intervals cover 0, suggesting that anchoring has no effect on the mean time to acknowledge the signal.

The pivotal method does not require that the variances of the log-transformed data are the same among treatment and group combinations (i.e., that the coefficients of variation are the same in the time scale). Running a standard ANOVA on log-transformed data and interpreting the results in terms of hypotheses about the means of observations in the original time scale only work if the variances of the log-transformed data are the same among groups (see Motivation—Means of Log-Normally Distributed Data section, above). Because means of log-normal distributions are functions of both the level and the variability of the log-transforms, an ANOVA on log-transformed data will provide incorrect conclusions about log-normal means if coefficients of variation differ among groups. Table 3 shows the results of an ANOVA on the log-transformed data. The *P* values are markedly different from those of Table 2. For example, Table 2 shows that previous messages do not have a statistically significant effect (*P* = 0.335). However, Table 3 shows that traditional ANOVA applied to logarithms obtains *P* = 0.012. An incorrect decision would be made if traditional ANOVA were used. The discrepancy reflects the heterogeneity of variances. From Table 1, for anchored data, the SD in the log scale equals 0.607 for no previous messages and 0.788 for previous messages. Squaring the terms, the variances are 0.368 and 0.622, respectively. The ratio of the variances equals 0.593, where 0.593 = 0.368/0.622. Using the *F* distribution, the confidence interval for the ratio equals 0.335 to 0.970, *P* = 0.037. Even small inequality in variances between groups can influence the interpretation of the results.

## Mixture Distributions

The results in Table 2 suggest that the 4 samples can be pooled. The 95% confidence interval for the overall mean latency of the 472 observations is obtained by calculating the 2.5% and 97.5% percentiles of the exponentials of the simulated values from equation (4) in Appendix 2. The interval extends from 1.23 to 1.39 minutes, appropriately straddling the observed overall latency of 1.30 minutes from Table 1.

Next, we add the information for the remaining latency data, as the preceding analysis used the 472 observations with 0 or 1 previous messages and latencies <5 minutes. There are another 66 observations with latencies longer than 5.0 minutes and 22 observations with 2 or 3 previous messages. The histogram of all 560 natural logarithms of latencies in Figure 6 shows that there is a bimodal distribution. The parameters of the mixture of 2 log-normal distributions were estimated, subject to the constraint of equal standard deviations in the log scale.^{†} There are 100*p* = 86% of messages with mean in the log scale of μ_{1} = 0.054. The other 100(1 − *p*) = 14% of messages have means in the log scale of μ_{2} = 2.048. The common estimated SD in the log scale is ξ = 0.601. Substituting these estimates into equation (1), the means are 1.3 minutes and 9.3 minutes, respectively. Thus, 86% of messages were responded to quickly, averaging 1.3 minutes. However, there was a second smaller (14%) population of messages with a slower response time, averaging 9.3 minutes. The red curve in Figure 6 shows this estimated distribution. The probability plot in Figure 7 shows the excellent fit. Whereas a single log-normal distribution has Kolmogorov–Smirnov test of fit *P* < 0.00001 (i.e., the data do not fit the proposed distribution), the mixture of 2 log-normal distributions has *P* = 0.347 (i.e., a reasonable fit).

Suppose that we had incorrectly treated the 560 observations as a sample from a single log-normal distribution. Then, the 95% confidence interval for the log-normal mean would have been calculated from the 2.5% and 97.5% percentiles of the exponentials of the simulated pivotal statistics from equation (4) in Appendix 1. Those limits are 1.95 to 2.33 minutes. Matching results in the Motivation—Means of Log-Normally Distributed Data section, the interval does not even include the sample mean of the 560 observations, 2.46 minutes. We include a small derivation in Appendix 3 showing why this behavior is observed.

Although the pivotal analysis is sensitive to violations of a single log-normal distribution, many distribution-free statistical methods are suitable for calculating the confidence interval for the mean of a positively valued right skewed distribution when the sample size is large.^{15} Banik and Kimbria recently performed Monte Carlo simulations to compare the performance of 18 such methods. For example, Chen's method^{16} estimates the 100(1 − α)% confidence interval by modifying the Student *t* distribution interval to include the coefficient of skewness. The simple equation (13) is given in Appendix 3. Applied to our data, Chen's 95% confidence interval is 1.96 to 2.94 minutes. Although these methods for single groups do not help with the ANOVA of the preceding section, they give us the confidence interval that we want for the pooled data.

## DISCUSSION

Anesthesia providers rely on receiving communications within minutes of their transmission, ideally <0.5 minutes for emergencies. The analysis of latencies between when messages are sent and when they are received will be an essential component of practical and regulatory assessment of clinical and managerial decision-support systems.^{1}^{–}^{3} Quantifying latency precisely is important because of its sometimes large but consistently complicated impact on decision making.^{5},^{6} Latency data including human response time have moderate sample sizes, large coefficients of variation (>1.70), and highly heterogeneous coefficients of variation among groups.

We showed that, for such data, ANOVA can obtain highly inaccurate results and correspondingly poor managerial decisions, whether performed in the time scale or in the log scale followed by taking the exponential of the result. In contrast, inference of mean latencies based on log-normal distributions can be performed using pivotal methods. Fixed-effects 2-way ANOVA can be extended to the comparison of means of log-normal distributions. Pivotal inference does not require that the coefficients of variation of the studied log-normal distributions are equal, and can be used to assess the interaction and the proportional main effects of 2 factors. We also showed that latency data including a human behavioral component can be bimodal in the log scale (i.e., a mixture of distributions). In such situations, ANOVA can be performed on the homogeneous segment of the data, followed by a single group analysis applied to all or portions of the data using the more robust method.

One alternative approach that would not be suitable is use of rank-based methods. First, these methods produce *P*-values, not confidence intervals, and the confidence intervals are needed to assess the practical importance of ratios of means. Second, the methods compare mean ranks between (among) groups, not means. For example, consider 2 groups following log-normal distributions with equal means, equal sample sizes of 50 subjects, but unequal coefficients of variation, 1.87 in one group and 0.81 in the other. The Wilcoxon test rejects the null hypothesis of equality at the α = 0.05 criterion for 68% of simulations.^{17} The value of 68% is far different from the appropriate 5.0% of simulations because although the means are equal, the medians are unequal, and the medians are close to the mean ranks that the Wilcoxon test compares.

Another alternative approach would be suitable if the sample sizes were much larger.^{18} First, sequential batches of latencies measured at equal time intervals (e.g., first 288, second 288, … ) are created. Second, the ratio of the means is calculated pairwise for each batch. Third, the analyses are performed using the ratios of the means, with the sample size being the number of batches. This method of batch means is how most statistical problems in operating room management are analyzed.^{19}^{–}^{21} The method is suitable for analyzing latencies when there are no humans involved (e.g., a computer sending messages through the 2 systems under comparison every 5 minutes and recording when messages were received on each device). For example, after 20 workdays, there could be 20 batches each with the pairwise comparison of 288 latencies, where 288 = 24 hours × 12 tests per hour. The method of batch means was not suitable for analyzing the data in Table 1 because the sample size of the smallest group was several-fold too small for the method. The same applied to the data^{8} in the Motivation—Means of Log-Normally Distributed Data section. For the 2-group comparison in the Motivation—Means of Log-Normally Distributed Data section, a Taylor series expansion method could be used with several batches, some small. However, the method would not be suitable for the ANOVA.^{22}

## RECUSE NOTE

Franklin Dexter is the Statistical Editor and section Editor for Economics, Education, and Policy for *Anesthesia & Analgesia*. This manuscript was handled by Steve Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.

## DISCLOSURES

**Name:** Johannes Ledolter, PhD.

**Contribution:** This author helped design the study, conduct the study, analyze the data, and write the manuscript.

**Attestation:** Johannes Ledolter approved the final manuscript.

**Name:** Franklin Dexter, MD, PhD.

**Contribution:** This author helped design the study, conduct the study, analyze the data, and write the manuscript.

**Attestation:** Franklin Dexter approved the final manuscript.

**Name:** Richard H. Epstein, MD, CPHIMS.

**Contribution:** This author helped conduct the study and write the manuscript.

**Attestation:** Richard H. Epstein approved the final manuscript.

† Maximum likelihood estimation was performed using the R function VGLM with option mix2normal1. Accessed January 8, 2011: http://rgm2.lab.nig.ac.jp/RGM2/R_man-2.9.0/library/VGAM/man/mix2normal1.html

Cited Here...

## REFERENCES

1. Epstein RH, Dexter F, Ehrenfeld JM, Sandberg WS. Implications of event entry latency on anesthesia information management system decision support systems. Anesth Analg 2009;108:941–7

2. Dexter F, Epstein RH, Lee JD, Ledolter J. Automatic updating of times remaining in surgical cases using Bayesian analysis of historical case duration data and instant messaging updates from anesthesia providers. Anesth Analg 2009;108:929–40

3. Epstein RH, Ekbatani A, Kaplan J, Shechter R, Grunwald Z. Development of a staff recall system for mass casualty incidents using cell phone text messaging. Anesth Analg 2010;110:871–8

4. Poon EG, Kuperman GJ, Fiskio J, Bates DW. Real-time notification of laboratory data requested by users through alphanumeric pagers. JAMIA 2002;9:217–22

5. Dexter F, Macario A, Traub RD. Enterprise-wide patient scheduling information systems to coordinate surgical clinic and operating room scheduling can impair operating room efficiency. Anesth Analg 2000;91:617–26

6. Epstein RH, Dexter F, Piotrowski E. Automated correction of room location errors in anesthesia information management systems. Anesth Analg 2008;107:965–71

7. Dexter F, Bayman EO, Epstein RH. Statistical modeling of average and variability of time to extubation for meta-analysis comparing desflurane to sevoflurane. Anesth Analg 2010;110:570–80

8. Jacques P St., France DJ, Pilla M, Lai E, Higgins MS. Evaluation of a hands-free wireless communication device in the perioperative environment. Telemed J e-Health 2006;12:42–9

9. Limpert E, Stahel WA, Abbt M. Log-normal distributions across the sciences: keys and clues. BioScience 2001;51:341–52

10. Weerahandi S. Generalized confidence intervals. J Am Stat Assoc 1993;88:899–905

11. Krishnamoorthy K, Mathew T. Inferences on the means of lognormal distributions using generalized p-values and generalized confidence intervals. J Stat Plan Infer 2003;115:103–21

12. Krishnamoorthya K, Lua F, Mathew T. A parametric bootstrap approach for ANOVA with unequal variances: fixed and random models. Comput Stat Data Anal 2007;51:5731–42

13. Spangler WE, Strum DP, Vargas LG, May JH. Estimating procedure times for surgeries by determining location parameters for the lognormal model. Health Care Manag Sci 2004;7:97–104

14. Li X. A generalized p-value approach for comparing the means of several lognormal distributions. Statist Probab Letters 2009;79:1404–8

15. Banik S, Kibria BMG. Comparison of some parametric and nonparametric type one sample confidence intervals for estimating the mean of a positively skewed distribution. Commun Stat Simulat 2010;39:361–89

16. Chen L. Testing the mean of skewed distribution. JASA 1995;90:767–72

17. Zhou XH, Gao S, Hui SL. Methods for comparing the means of two independent log-normal samples. Biometrics 1997;53:1129–35

18. Law AM, Kelton WD. Simulation Modeling and Analysis. 2nd ed. New York McGraw-Hill, 1991: 551–3

19. Masursky D, Dexter F, O'Leary CE, Applegeet C, Nussmeier NA. Long-term forecasting of anesthesia workload in operating rooms from changes in a hospital's local population can be inaccurate. Anesth Analg 2008;106:1223–31

20. Dexter F, Epstein RH, Marcon E, Ledolter J. Estimating the incidence of prolonged turnover times and delays by time of day. Anesthesiology 2005;102:1242–8

21. Wachtel RE, Dexter F. Influence of the operating room schedule on tardiness from scheduled start times. Anesth Analg 2009;108:1889–901

22. Ledolter J, Dexter F. Analysis of interventions influencing or reducing patient waiting while stratifying by surgical procedure. Anesth Analg 2011;112:950–7

23. Student The probable error of a mean. Biometrika 1908;6:1–25

### APPENDIX 1

#### From Motivation—Means of Log-Normally Distributed Data Section

The following steps calculate a 100(1−α) percent confidence interval for the ratio of 2 means^{11}:

For j=1 to 2

For k = 1 to m (e.g., m = 100,000)

Generate Z_{j}^{k}: a normally distributed random number with mean 0 and variance 1

Generate U_{j}^{k}: the square root of a χ^{2} distributed random number with n_{j}−1 *df*.

where y[Combining Macron]_{j} and s_{j} are the sample means and sample standard deviations of the log-transformed data

Next *k*

Next *j*

Calculate the 100(α/2) and 100(1−α/2) percentiles of the simulated quantities

, *k*=1, … *m*.

Equation (4) for T_{meanj}^{k} follows from the same 2 statistical properties as Student's *t* test^{23}:

follows a normal distribution (Z_{j}), and (n_{j}−1)s_{j}^{2}/σ_{j}^{2} follows a χ^{2} distribution (U_{j}^{2}). Data and R code are available as a Web supplement (see Supplemental Digital Content 1, http://links.lww.com/AA/A292).

We performed Monte Carlo simulations of equations (4) and (5) for pairs of log-normal distributions with common μ = *1.0 and σ*^{2} = 0.1. Whereas for *n*_{1} = *n*_{2} = 40, coverage of α = 0.050 confidence intervals was 0.050 (SE 0.003), for *n*_{1} = *n*_{2} = 20 coverage was 0.046, and for *n*_{1} = *n*_{2} = 10 coverage was 0.040. Simulated coverage was no different for 2 normal distributions with the same parameters. Krishnamoorthya and colleagues showed similar deterioration of coverage for smaller sample sizes and normal distributions while using different parameter values and numbers of groups.^{11},^{12} To understand results, we consider the extreme *n*_{1} = *n*_{2} = 2 for which our simulations of 2 normal distributions demonstrated very poor coverage (0.004), in comparison with exact coverage for a single normal distribution (0.050). The generalized probability value of the pivotal test statistic is

where t_{1} and t_{2} have *t* distributions with 1 degree of freedom and

is the pooled variance. Under the assumption of independent normal random variables with equal means and variances,

has a *t* distribution with

, where W_{1}=a_{1}t_{1} is Cauchy with scale

and W_{2} = a_{2}t_{2} is Cauchy with scale

. Hence W=W_{1}+W_{2} is Cauchy with scale

ranging from a minimum of 1 (when 1 sample variance dominates the other) to a maximum of ✓2 (when the sample variances are the same). Consequently,

, where *F* is the cdf of the Cauchy distribution with scale

and

, where *G* is the cdf of the *t* distribution with 2 *df.* For scale λ = ✓2 and *w* = 0.025, *F*^{−1}(0.025) = −17.9692 and *GF*^{−1}(0.025) = *G*(−17.9692) =0.0015. For scale λ = 1 (which implies a *t* distribution with 1 degree of freedom) and *w*=0.025, *F*^{−1}(0.025) = −12.7062, and *GF*^{−1}(0.025) = *G*(−12.7062) = 0.0031. For a 2-sided test with significance level 0.05, the rejection probability is doubled, implying bounds 0.003≤P[GPV≤0.05]≤0.006, matching the above simulated value of 0.004. Thus, failure to achieve the 0.05 significance level results from reliance on separate variance estimates from small sample sizes in comparison with assuming that the variances are equal. This is the so-called Behrens–Fisher problem (e.g., Student *t* distribution with unequal versus equal variances).

### APPENDIX 2

#### From Analysis of Variance Section

Our analysis of the fixed effects 2-way ANOVA with 2 levels for the treatment factor (referred to as *factor 1*) and *g* levels for the group factor (referred to as *factor 2*) starts with an overall test of equality of the *2g* means, *E*(*X*_{ij}) = exp(μ_{ij} + 0.5σ_{ij}^{2}) = exp(η_{ij}), *i* = 1, 2 and j=1, 2, …, *g*. The null hypothesis is written as H_{0}:η_{11}=η_{21}=…=η_{1g}=η_{2g}. For our data, the 2nd factor (previous message) has *g* = 2 groups.

Li defines the vector of pivotal statistics

, with elements^{14}

where y[Combining Macron]_{ij} and s_{ij} are the sample means and sample standard deviations of the log-transformed data for the (ij)^{th} treatment/group combination, and the Z_{ij} and U_{ij}^{2} are independent standard normal and χ^{2} (with (n_{ij}−1) *df*) random variables. For simplicity of notation, the superscript *k* corresponding to *k* = 1, 2, …, *m* simulations is not displayed. The term

follows a Student *t* distribution and thus has expectation of zero. Furthermore,

. Therefore, from equation (6), the elements of the mean (vector) are^{14}

Because the variance of the *t* distribution equals

and variance

, the diagonal covariance (matrix) V=V(T_{η[Combining Circumflex Accent]}) has elements^{14}

The null hypothesis H_{0}:η_{11}=η_{21}=…=η_{1g}=η_{2g} is expressed as H_{0}:H_{A}η=0, where 0 = (0, 0…, 0)′ is a 2*g*−1 × 1 vector of zeros, η=(η_{11}, η_{21}, …, η_{1g}, η_{2g})′, and H_{A} is the 2*g*−1 × *g* constraint matrix

Li^{14} showed that the null hypothesis can be tested by computing the test statistic

For each of the *m* simulations, the vector of pivotal statistics T_{η[Combining Circumflex Accent]} and the value of the test statistic TS_{A} are calculated. The generalized probability value P[TS_{A}>0] is estimated from the proportion of the simulated TS_{A}>0. The null hypothesis H_{0}:η_{11}=η_{21}=…=η_{1g}=η_{2g} is rejected when that estimated P[TS_{A}>0]<0.05. Accepting the null hypothesis implies that all means are the same and that neither factor has an effect on the response.

We adapt Li's general testing approach in equation (10) to test interaction and main effects. For each level *j* of the group factor (factor 2), the log-ratio of the 2 treatment means,

, expresses the proportional effect of treatment. Interaction is absent if these *g* log-ratios are the same. We write the null hypothesis of no interaction H_{0}:η_{21}−η_{11}=η_{22}−η_{12}=…=η_{2g}−η_{1g} as H_{0}:H_{INT}η=0, where 0=(0, 0,…, 0)′ is a *g* − 1 × 1 vector of zeros, η=(η_{11}, η_{21}, …, η_{1g}, η_{2g})′, and H_{INT} is the *g* − 1 × 2*g* constraint matrix

The test statistic for interaction is obtained from equation (10), replacing the constraint matrix H_{A} with H_{INT}. Rejection of H_{0} implies that the proportional treatment effects depend on the group (2nd) factor. When there is appreciable interaction, it can be misleading to talk about main effects of treatment (factor 1) and group (factor 2). If the 2nd factor represents groups that cannot be controlled, such as the number of previous messages, confidence intervals of the proportional treatment effects would be reported separately for each group.

There is no main effect of treatment (factor 1) if η_{21}+η_{22}+…+η_{2g}=η_{11}+η_{12}+…+η_{1g} or (η_{21}−η_{11})+(η_{22}−η_{12})+…+(η_{2g}−η_{1g})=0 (i.e., the sum or average of the *g* proportional treatment effects is zero). The null hypothesis is expressed as H_{0}:H_{F1}η=0, with the 2*g* × 1 constraint matrix H_{F1}=(−1, 1, −1, 1, …, −1, 1).

There is no main effect of group (factor 2) if the *g* sums η_{1j}+η_{2j}, j=1, 2, …, g, are the same. The null hypothesis is expressed as H_{0}:H_{F2}η=0 with H_{F2} the *g* − 1 × 2*g* matrix

### APPENDIX 3

#### From Mixture Distributions Section

The 95% confidence interval for the mean calculated using equation (4) and all 560 observations did not straddle the sample mean. To understand why, let *W* follow the Bernoulli distribution with parameter *p*. The pivotal approach assuming a single log-normal distribution X=exp(Y) effectively assumes a normally distributed Y=WY_{1}+(1−W)Y_{2} with mean

and variance

The assumed mean Mean(X)=exp(μ_{M}+0.5σ_{M}^{2})=1.99 minutes, which is within the estimated 95% confidence interval of 1.95 to 2.33 minutes. However, the 1.99 minutes is smaller than the sample mean of 2.46 minutes, which is not within the confidence interval. Instead of relying on the pivotal method, which is inaccurate in this context, we apply Chen's method to estimate the 100(1 − α)% confidence interval for the mean. Chen's method modifies the Student *t* distribution interval to include the coefficient of skewness^{16}:

where x[Combining Macron], s_{x}, and [Combining Circumflex Accent]γ are the sample mean, SD, and skewness of the observations of the single group in the time scale. Banik and Kimbria^{15} showed that for a log-normal distribution with n=100 and σ_{M}=1.0 (for our data, *n* = 560 and [Combining Circumflex Accent]σ_{M}=0.92), Chen's method covers the true mean for 93.8% of simulations, close to the desired 95.0% rate.^{15}