Statistical Analysis
Introduction
Insights into the origins of a disease outbreak (or scientific studies, generally) typically emerge during the statistical analysis of the data. The “Statistics” folder in the Command Explorer of Classic Analysis provides simple descriptive statistics and graphing tools. More sophisticated inferential statistics, such as linear and logistics regression analysis, are available in “Advanced Statistics”, as well as tools especially useful in epidemiology, such as Kaplan-Meier Survival curves and Cox Proportional Hazards. Descriptive statistics under complex sampling models are also supported.
Data Analysis Command Reference
The Epi Info™ User Guide provides a reference guide to the commands available in Classic Analysis and, by extension, the Visual Dashboard introduced with Epi Info™ 7.0. Each entry in the Command Reference contains:
- a brief description of the command's purpose and function;
- the execution syntax, a formal description of the usage of command arguments and parameters, including constraints on allowed values (which are mandatory and which are optional);
- optional additional comments; and
- an example of command usage, including arguments and actual parameters.
The following table shows the available Classic Analysis (CA) Statistics commands with their Visual Dashboard (VD) equivalents. The table describes the purpose and output of each command and provides links to the corresponding sections of CA and VD User Guides for the commands.
Classic Analysis and Visual Dashboard Command Summary
Statistics | |||
---|---|---|---|
Classic Analysis Menu Command | Visual Dashboard Gadget | Purpose | User Guide Links |
Analysis Commands→ Statistics→List | Add Analysis Gadget→Line List | Provides tabular display of selected variables. | CA, VD |
Analysis Commands→ Statistics→Frequencies | Add Analysis Gadget→Frequency | Tabulates frequencies of dependent variable(s), optionally stratified and weighted by other variables. | CA, VD |
N/A | Add Analysis Gadget→Word Cloud | Calculates the frequencies of individual "words" from a (preferably text) variable, then displays an image composed of those words, scaled and colored to reflect their frequencies in the data, with more frequent words having a larger font and the colors "striped" [1] to provide gradations of scale within a given point size. | N/A , VD |
N/A | Add Analysis Gadget→Combined Frequency | Tabulates frequencies of all Boolean variables in a group. The group can be defined in Form Designer using the Group Field or in Visual Dashboard using Defined Variables→Create Variable Group feature. Groups can be created in Classic Analysis using the DEFINE command with the GROUPVAR keyword; however, the Combined Frequency analysis has not (yet) been implemented in CA. | N/A , VD |
Analysis Commands→ Statistics→Tables | Add Analysis Gadget→M × N / 2 × 2 Table | Tabulates the number and frequency of subjects with respect to a dependent variable (outcome) vs. a risk factor, allowing weighting of the former and stratification of the latter; and calculates odds-based and risk-based estimation of 95% confidence interval and statistical tests, including chi-square and 1- and 2-tailed T-tests. | CA, VD |
N/A | Add Analysis Gadget→ Matched Pair-Case Control | Given pairs of individuals, matched for physical, demographic, or other relevant characteristics, but differing in affected status, calculates the following statistics based on a dichotomous exposure variable:
| N/A , VD |
Analysis Commands→ Statistics→Means | Add Analysis Gadget→Means | Tabulates means for a continuous variable with weighting, stratification and cross-tabulation by value, calculating descriptive statistics and:
| CA, VD |
N/A | Add Analysis Gadget→ Duplicates List | Creates a list of duplicate records by sorting on one or more user-selected variables and omitting unique records. The user may select additional variables to be displayed in Duplicates List that are not used for determining uniqueness but are useful for the identification of specific records for data management (e.g., correction or removal) or other purposes. | VD |
Analysis Commands→Statistics→ Summarize +
| N/A | Summarizes dataset by aggregating multiple descriptive statistics in a tabular format and optionally grouping by, and/or weighting by selected variables.
| CA |
Analysis Commands→ Statistics→Graph +
| Add Analysis Gadget→Charts +
| Creates graphs or charts of various types based on one or more variables in the dataset.
| CA,VD |
Note:
- Color "striping" is the practice of mapping a repeating sequence of colors to a numerical scale to facilitate the visualization of small increments that might not be apparent otherwise. In the case of the Word Cloud, there are two parameters used to visualize word frequency: font size (six point sizes that users can readily distinguish from one another) and color (dark green, light green, red, orange, yellow, and brown). In an example with a dynamic range of 3% to 21% (18 percentage points), a linear scale might map the 36 gradations of size and color as follows: 10 pt font: 3.0 - 3.5% = dark green, 3.5 - 4.0% = light green, ... , 5.0 - 5.5% = yellow, 5.5 - 6.0% = brown; 12 point font: 6.0 - 6.5% = dark green, ... , 20 point font: 18.0 - 18.5% = dark green, ... , 20.5‒21.0% = brown.
Advanced Statistics
Advanced Statistics | |||
---|---|---|---|
Classic Analysis Menu Command | Visual Dashboard Gadget | Purpose | User Guide Links |
Analysis Commands→ Advanced Statistics→ Linear Regression | Add Analysis Gadget→ Advanced Statistics→ Linear Regression | Enables the user to evaluate how well one or more continuous variables predict another continuous variable. Assuming a linear relationship between a dependent variable and one or more independent variables, the method calculates independent variable coefficient(s) and y-intercept, as well as least-squares goodness of fit and standard error, F-tests, and P-values for coefficient(s) and constant. | CA, VD |
Analysis Commands→ Advanced Statistics→ Logistic Regression | Add Analysis Gadget→ Advanced Statistics→ Logistic Regression | Maps the independent variable(s) to a probability for the dependent variable (bounded to be: 0 ≤ p ≤ 1) through the logistic function, 1 / ( 1 + e-βX). The coefficients are solved for using an iterative optimization procedure such as gradient descent to minimize the difference between the predicted values and those already observed. The goodness of fit is evaluated with Z-score and P-value. Logistic regression is similar to linear regression but is appropriate for a dichotomous dependent variable. | CA, VD |
Analysis Commands→ Advanced Statistics→ Kaplan-Meier Survival | N/A | Provides a nonparametric method for estimating the probability of survival at a given time in the duration of a study. It is calculated given time and test group variables as well as the "censored" variable, which indicates incomplete information on a subject (who, for example, either "survived" or could not be tracked for the duration of the study period. The plotted curve gives a subjective feel for the relationship between the probability of survival as a function of time. However, the main purpose of the test is to compare survival curves under different study regimens, evaluated using a log-rank- or Wilcoxon-derived P-value. | CA |
Analysis Commands→ Advanced Statistics→Cox Proportional Hazards | N/A | Calculates a time-to-event analysis; widely used and applicable to many types of clinical studies. In addition to the data required for Kaplan-Meier Survival, Cox Proportional Hazards allows the use of multiple predictor variables with potential interactions. The routine provides P-values for the comparison of test groups and hazard ratios, confidence intervals, Z-scores, and P-values for the predictor variables. This allows for the evaluation of more complex modules than with Kaplan-Meier. | CA |
Analysis Commands→ Advanced Statistics→ Complex Sample Frequencies | Add Analysis Gadget→ Advanced Statistics→ Complex Sample Frequencies | Calculates the frequency of the specified variable given an additional variable representing the Primary Sampling Unit (PSU). | CA, VD |
Analysis Commands→ Advanced Statistics→ Complex Sample Tables | Add Analysis Gadget→ Advanced Statistics→ Complex Sample Tables | Calculates the relationship between exposure and outcome variables, allowing for weighting and stratification by other variables, given an additional variable representing the PSU. This allows for the analysis of the 2 × 2 table in light of the specified sampling procedure. | CA, VD |
Analysis Commands→ Advanced Statistics→ Complex Sample Means | Add Analysis Gadget→ Advanced Statistics→ Complex Sample Means | Calculates the mean of the specified variable given an additional variable representing the PSU. | CA, VD |
Future Development
- CA linear regression does not work when "confidence limits" are specified using the pull-down menu in the Command Explorer dialog. Similarly, the corresponding commands in the form "
REGRESS WEIGHT=HEIGHT PVALUE=<val>
" fail when using thePVALUE
parameter and reasonable arguments such as90%
,95%
, and99%
. - Add features in Visual Dashboard that are not yet in Classic Analysis (as appropriate) including Word Cloud “graph”, Combined Frequency, Matched Pair Case Control (Epi Info™ 3 had this functionality implemented as the command “MATCH”), and Duplicates List.