Requirement	Statistical Analysis
Section	3.2.3.3
JIRA Task	EIR-63 - Getting issue details... STATUS

Reviewed For	Date
Conventional spacing between sections	2016-08-31

Introduction

Insights into the origins of a disease outbreak (or scientific studies, generally) typically emerge during the statistical analysis of the data. The “Statistics” folder in the Command Explorer of Classic Analysis provides simple descriptive statistics and graphing tools. More sophisticated inferential statistics, such as linear and logistics regression analysis, are available in “Advanced Statistics”, as well as tools especially useful in epidemiology, such as Kaplan-Meier Survival curves and Cox Proportional Hazards. Descriptive statistics under complex sampling models are also supported.

Data Analysis Command Reference

The Epi Info™ User Guide provides a reference guide to the commands available in Classic Analysis and, by extension, the Visual Dashboard introduced with Epi Info™ 7.0. Each entry in the Command Reference contains:

a brief description of the command's purpose and function;
the execution syntax, a formal description of the usage of command arguments and parameters, including constraints on allowed values (which are mandatory and which are optional);
1. optional additional comments; and
an example of command usage, including arguments and actual parameters.

The following table shows the available Classic Analysis (CA) Statistics commands with their Visual Dashboard (VD) equivalents. The table describes the purpose and output of each command and provides links to the corresponding sections of CA and VD User Guides for the commands.

Classic Analysis and Visual Dashboard Command Summary

Statistics
Classic Analysis Menu Command	Visual Dashboard Gadget	Purpose	User Guide Links
Analysis Commands→ Statistics→List	Add Analysis Gadget→Line List	Provides tabular display of selected variables.	CA, VD
Analysis Commands→ Statistics→Frequencies	Add Analysis Gadget→Frequency	Tabulates frequencies of dependent variable(s), optionally stratified and weighted by other variables.	CA, VD
`N/A`	Add Analysis Gadget→Word Cloud	Calculates the frequencies of individual "words" from a (preferably text) variable, then displays an image composed of those words, scaled and colored to reflect their frequencies in the data, with more frequent words having a larger font and the colors "striped" ^[1] to provide gradations of scale within a given point size.	`N/A`, VD
`N/A`	Add Analysis Gadget→Combined Frequency	Tabulates frequencies of all Boolean variables in a group. The group can be defined in Form Designer using the Group Field or in Visual Dashboard using Defined Variables→Create Variable Group feature. Groups can be created in Classic Analysis using the DEFINE command with the GROUPVAR keyword; however, the Combined Frequency analysis has not (yet) been implemented in CA.	`N/A`, VD
Analysis Commands→ Statistics→Tables	Add Analysis Gadget→M × N / 2 × 2 Table	Tabulates the number and frequency of subjects with respect to a dependent variable (outcome) vs. a risk factor, allowing weighting of the former and stratification of the latter; and calculates odds-based and risk-based estimation of 95% confidence interval and statistical tests, including chi-square and 1- and 2-tailed T-tests.	CA, VD
`N/A`	Add Analysis Gadget→ Matched Pair-Case Control	Given pairs of individuals, matched for physical, demographic, or other relevant characteristics, but differing in affected status, calculates the following statistics based on a dichotomous exposure variable: chi-square and 2-tailed P-value (McNemar & corrected), 1-tailed and 2-tailed P-values (Fisher's Exact), and Odds-based parameters: Odds Ratio (Estimate, Lower, and Upper) Exact (Lower and Upper)	`N/A`, VD
Analysis Commands→ Statistics→Means	Add Analysis Gadget→Means	Tabulates means for a continuous variable with weighting, stratification and cross-tabulation by value, calculating descriptive statistics and: ANOVA, a Parametric Test for Inequality of Population Means T-Test, Pooled (Equal Variances) and Satterthwaite (Unequal Variances) Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups) Bartlett's Test for Inequality of Population Variances [Classic Analysis only]	CA, VD
`N/A`	Add Analysis Gadget→ Duplicates List	Creates a list of duplicate records by sorting on one or more user-selected variables and omitting unique records. The user may select additional variables to be displayed in Duplicates List that are not used for determining uniqueness but are useful for the identification of specific records for data management (e.g., correction or removal) or other purposes.	VD
Analysis Commands→Statistics→ Summarize + Aggregate→Average Aggregate→Count Aggregate→First Aggregate→Last Aggregate→Maximum Aggregate→Minimum Aggregate→Standard Deviation Aggregate→Sum Aggregate→Variance	`N/A`	Summarizes dataset by aggregating multiple descriptive statistics in a tabular format and optionally grouping by, and/or weighting by selected variables. Average Count First Last Maximum Minimum Standard Deviation Sum Variance	CA
Analysis Commands→ Statistics→Graph + Graph Type→Area Graph Type→Bar Graph Type→Bubble Graph Type→Column Graph Type→Epi Curve Graph Type→Line Graph Type→Pie Graph Type→Scatter	Add Analysis Gadget→Charts + Area Chart Column Chart Epi Curve Chart Line Chart Pie Chart Scatter Chart Aberration Detection Chart Pareto Chart	Creates graphs or charts of various types based on one or more variables in the dataset. Area Bar [Classic Analysis only] Bubble [Classic Analysis only] Column Epi Curve Line Pie Scatter Aberration Detection Chart [Visual Dashboard only] Pareto Chart [Visual Dashboard only]	CA,VD

Note:

Color "striping" is the practice of mapping a repeating sequence of colors to a numerical scale to facilitate the visualization of small increments that might not be apparent otherwise. In the case of the Word Cloud, there are two parameters used to visualize word frequency: font size (six point sizes that users can readily distinguish from one another) and color (dark green, light green, red, orange, yellow, and brown). In an example with a dynamic range of 3% to 21% (18 percentage points), a linear scale might map the 36 gradations of size and color as follows: 10 pt font: 3.0 - 3.5% = dark green, 3.5 - 4.0% = light green, ... , 5.0 - 5.5% = yellow, 5.5 - 6.0% = brown; 12 point font: 6.0 - 6.5% = dark green, ... , 20 point font: 18.0 - 18.5% = dark green, ... , 20.5‒21.0% = brown.

Advanced Statistics

Advanced Statistics
Classic Analysis Menu Command	Visual Dashboard Gadget	Purpose	User Guide Links
Analysis Commands→ Advanced Statistics→ Linear Regression	Add Analysis Gadget→ Advanced Statistics→ Linear Regression	Enables the user to evaluate how well one or more continuous variables predict another continuous variable. Assuming a linear relationship between a dependent variable and one or more independent variables, the method calculates independent variable coefficient(s) and y-intercept, as well as least-squares goodness of fit and standard error, F-tests, and P-values for coefficient(s) and constant.	CA, VD
Analysis Commands→ Advanced Statistics→ Logistic Regression	Add Analysis Gadget→ Advanced Statistics→ Logistic Regression	Maps the independent variable(s) to a probability for the dependent variable (bounded to be: 0 ≤ p ≤ 1) through the logistic function, 1 / ( 1 + e^-βX). The coefficients are solved for using an iterative optimization procedure such as gradient descent to minimize the difference between the predicted values and those already observed. The goodness of fit is evaluated with Z-score and P-value. Logistic regression is similar to linear regression but is appropriate for a dichotomous dependent variable.	CA, VD
Analysis Commands→ Advanced Statistics→ Kaplan-Meier Survival	`N/A`	Provides a nonparametric method for estimating the probability of survival at a given time in the duration of a study. It is calculated given time and test group variables as well as the "censored" variable, which indicates incomplete information on a subject (who, for example, either "survived" or could not be tracked for the duration of the study period. The plotted curve gives a subjective feel for the relationship between the probability of survival as a function of time. However, the main purpose of the test is to compare survival curves under different study regimens, evaluated using a log-rank- or Wilcoxon-derived P-value.	CA
Analysis Commands→ Advanced Statistics→Cox Proportional Hazards	`N/A`	Calculates a time-to-event analysis; widely used and applicable to many types of clinical studies. In addition to the data required for Kaplan-Meier Survival, Cox Proportional Hazards allows the use of multiple predictor variables with potential interactions. The routine provides P-values for the comparison of test groups and hazard ratios, confidence intervals, Z-scores, and P-values for the predictor variables. This allows for the evaluation of more complex modules than with Kaplan-Meier.	CA
Analysis Commands→ Advanced Statistics→ Complex Sample Frequencies	Add Analysis Gadget→ Advanced Statistics→ Complex Sample Frequencies	Calculates the frequency of the specified variable given an additional variable representing the Primary Sampling Unit (PSU).	CA, VD
Analysis Commands→ Advanced Statistics→ Complex Sample Tables	Add Analysis Gadget→ Advanced Statistics→ Complex Sample Tables	Calculates the relationship between exposure and outcome variables, allowing for weighting and stratification by other variables, given an additional variable representing the PSU. This allows for the analysis of the 2 × 2 table in light of the specified sampling procedure.	CA, VD
Analysis Commands→ Advanced Statistics→ Complex Sample Means	Add Analysis Gadget→ Advanced Statistics→ Complex Sample Means	Calculates the mean of the specified variable given an additional variable representing the PSU.	CA, VD

Future Development

CA linear regression does not work when "confidence limits" are specified using the pull-down menu in the Command Explorer dialog. Similarly, the corresponding commands in the form "REGRESS WEIGHT=HEIGHT PVALUE=<val>" fail when using the PVALUE parameter and reasonable arguments such as 90%, 95%, and 99%.
Add features in Visual Dashboard that are not yet in Classic Analysis (as appropriate) including Word Cloud “graph”, Combined Frequency, Matched Pair Case Control (Epi Info™ 3 had this functionality implemented as the command “MATCH”), and Duplicates List.

Epi Info 7 Requirements

Statistical Analysis

Introduction

Data Analysis Command Reference

Classic Analysis and Visual Dashboard Command Summary

Advanced Statistics

Future Development

Related content