Statistical Analysis

RequirementStatistical Analysis
Section3.2.3.3
JIRA Task

EIR-63 - Getting issue details... STATUS

Reviewed ForDate
Conventional spacing between sections2016-08-31

Introduction

Insights into the origins of a disease outbreak (or scientific studies, generally) typically emerge during the statistical analysis of the data.  The “Statistics” folder in the Command Explorer of Classic Analysis provides simple descriptive statistics and graphing tools.  More sophisticated inferential statistics, such as linear and logistics regression analysis, are available in “Advanced Statistics”, as well as tools especially useful in epidemiology, such as Kaplan-Meier Survival curves and Cox Proportional Hazards. Descriptive statistics under complex sampling models are also supported.

 

Data Analysis Command Reference

The Epi Info™ User Guide provides a reference guide to the commands available in Classic Analysis and, by extension, the Visual Dashboard introduced with Epi Info™ 7.0.  Each entry in the Command Reference contains:

  1. a brief description of the command's purpose and function;
  2. the execution syntax, a formal description of the usage of command arguments and parameters, including constraints on allowed values (which are mandatory and which are optional);
    1. optional additional comments; and
  3. an example of command usage, including arguments and actual parameters.

The following table shows the available Classic Analysis (CA) Statistics commands with their Visual Dashboard (VD) equivalents.  The table describes the purpose and output of each command and provides links to the corresponding sections of CA and VD User Guides for the commands.

Classic Analysis and Visual Dashboard Command Summary

Statistics

Classic Analysis Menu Command

Visual Dashboard GadgetPurposeUser Guide Links
Analysis Commands→ Statistics→ListAdd Analysis Gadget→Line ListProvides tabular display of selected variables.CA, VD
Analysis Commands→ Statistics→FrequenciesAdd Analysis Gadget→FrequencyTabulates frequencies of dependent variable(s), optionally stratified and weighted by other variables.CAVD
N/AAdd Analysis Gadget→Word CloudCalculates the frequencies of individual "words" from a (preferably text) variable, then displays an image composed of those words, scaled and colored to reflect their frequencies in the data, with more frequent words having a larger font and the colors "striped" [1] to provide gradations of scale within a given point size.N/A, VD
N/AAdd Analysis Gadget→Combined FrequencyTabulates frequencies of all Boolean variables in a group. The group can be defined in Form Designer using the Group Field or in Visual Dashboard using Defined Variables→Create Variable Group feature.  Groups can be created in Classic Analysis using the DEFINE command with the GROUPVAR keyword; however, the Combined Frequency analysis has not (yet) been implemented in CA.N/A, VD
Analysis Commands→ Statistics→Tables

Add Analysis Gadget→M × N / 2 × 2 Table

Tabulates the number and frequency of subjects with respect to a dependent variable (outcome) vs. a risk factor, allowing weighting of the former and stratification of the latter; and calculates odds-based and risk-based estimation of 95% confidence interval and statistical tests, including chi-square and 1- and 2-tailed T-tests.CA, VD
N/AAdd Analysis Gadget→ Matched Pair-Case Control

Given pairs of individuals, matched for physical, demographic, or other relevant characteristics, but differing in affected status, calculates the following statistics based on a dichotomous exposure variable:

  1. chi-square and 2-tailed P-value (McNemar & corrected),
  2. 1-tailed and 2-tailed P-values (Fisher's Exact), and
  3. Odds-based parameters:
    1. Odds Ratio (Estimate, Lower, and Upper)
    2. Exact (Lower and Upper)
N/A, VD
Analysis Commands→ Statistics→MeansAdd Analysis Gadget→Means

Tabulates means for a continuous variable with weighting, stratification and cross-tabulation by value, calculating descriptive statistics and:

  1. ANOVA, a Parametric Test for Inequality of Population Means
  2. T-Test, Pooled (Equal Variances) and Satterthwaite (Unequal Variances)
  3. Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups)
  4. Bartlett's Test for Inequality of Population Variances [Classic Analysis only]
CA, VD
N/AAdd Analysis Gadget→ Duplicates ListCreates a list of duplicate records by sorting on one or more user-selected variables and omitting unique records. The user may select additional variables to be displayed in Duplicates List that are not used for determining uniqueness but are useful for the identification of specific records for data management (e.g., correction or removal) or other purposes.VD

Analysis Commands→Statistics→ Summarize +

  • Aggregate→Average
  • Aggregate→Count
  • Aggregate→First
  • Aggregate→Last
  • Aggregate→Maximum
  • Aggregate→Minimum
  • Aggregate→Standard Deviation
  • Aggregate→Sum
  • Aggregate→Variance
N/A

Summarizes dataset by aggregating multiple descriptive statistics in a tabular format and optionally grouping by, and/or weighting by selected variables.

  1. Average
  2. Count
  3. First
  4. Last
  5. Maximum
  6. Minimum
  7. Standard Deviation
  8. Sum
  9. Variance


CA

Analysis Commands→ Statistics→Graph +

  • Graph Type→Area
  • Graph Type→Bar
  • Graph Type→Bubble
  • Graph Type→Column
  • Graph Type→Epi Curve
  • Graph Type→Line
  • Graph Type→Pie
  • Graph Type→Scatter

Add Analysis Gadget→Charts +

  • Area Chart
  • Column Chart
  • Epi Curve Chart
  • Line Chart
  • Pie Chart
  • Scatter Chart
  • Aberration Detection Chart
  • Pareto Chart

Creates graphs or charts of various types based on one or more variables in the dataset.

  1. Area
  2. Bar [Classic Analysis only]
  3. Bubble [Classic Analysis only]
  4. Column
  5. Epi Curve
  6. Line
  7. Pie
  8. Scatter
  9. Aberration Detection Chart [Visual Dashboard only]
  10. Pareto Chart [Visual Dashboard only]
CA,VD

Note:

  1. Color "striping" is the practice of mapping a repeating sequence of colors to a numerical scale to facilitate the visualization of small increments that might not be apparent otherwise.  In the case of the Word Cloud, there are two parameters used to visualize word frequency: font size (six point sizes that users can readily distinguish from one another) and color (dark green, light green, red, orange, yellow, and brown).  In an example with a dynamic range of 3% to 21% (18 percentage points), a linear scale might map the 36 gradations of size and color as follows: 10 pt font: 3.0 - 3.5% = dark green, 3.5 - 4.0% = light green, ... , 5.0 - 5.5% = yellow, 5.5 - 6.0% = brown; 12 point font: 6.0 - 6.5% = dark green, ... , 20 point font: 18.0 - 18.5% = dark green, ... , 20.5‒21.0% = brown.


Advanced Statistics

Advanced Statistics
Classic Analysis Menu CommandVisual Dashboard GadgetPurposeUser Guide Links
Analysis Commands→ Advanced Statistics→ Linear RegressionAdd Analysis Gadget→ Advanced Statistics→ Linear RegressionEnables the user to evaluate how well one or more continuous variables predict another continuous variable. Assuming a linear relationship between a dependent variable and one or more independent variables, the method calculates independent variable coefficient(s) and y-intercept, as well as least-squares goodness of fit and standard error, F-tests, and P-values for coefficient(s) and constant.CA, VD
Analysis Commands→ Advanced Statistics→ Logistic RegressionAdd Analysis Gadget→ Advanced Statistics→ Logistic RegressionMaps the independent variable(s) to a probability for the dependent variable (bounded to be: 0 ≤ p ≤ 1) through the logistic function, 1 / ( 1 + eX). The coefficients are solved for using an iterative optimization procedure such as gradient descent to minimize the difference between the predicted values and those already observed. The goodness of fit is evaluated with Z-score and P-value. Logistic regression is similar to linear regression but is appropriate for a dichotomous dependent variable.CA, VD 
Analysis Commands→ Advanced Statistics→ Kaplan-Meier SurvivalN/AProvides a nonparametric method for estimating the probability of survival at a given time in the duration of a study. It is calculated given time and test group variables as well as the "censored" variable, which indicates incomplete information on a subject (who, for example, either "survived" or could not be tracked for the duration of the study period. The plotted curve gives a subjective feel for the relationship between the probability of survival as a function of time. However, the main purpose of the test is to compare survival curves under different study regimens, evaluated using a log-rank- or Wilcoxon-derived P-value.CA
Analysis Commands→ Advanced Statistics→Cox Proportional HazardsN/ACalculates a time-to-event analysis; widely used and applicable to many types of clinical studies. In addition to the data required for Kaplan-Meier Survival, Cox Proportional Hazards allows the use of multiple predictor variables with potential interactions. The routine provides P-values for the comparison of test groups and hazard ratios, confidence intervals, Z-scores, and P-values for the predictor variables. This allows for the evaluation of more complex modules than with Kaplan-Meier.CA
Analysis Commands→ Advanced Statistics→ Complex Sample FrequenciesAdd Analysis Gadget→ Advanced Statistics→ Complex Sample FrequenciesCalculates the frequency of the specified variable given an additional variable representing the Primary Sampling Unit (PSU).CA, VD 
Analysis Commands→ Advanced Statistics→ Complex Sample TablesAdd Analysis Gadget→ Advanced StatisticsComplex Sample TablesCalculates the relationship between exposure and outcome variables, allowing for weighting and stratification by other variables, given an additional variable representing the PSU. This allows for the analysis of the 2 × 2 table in light of the specified sampling procedure.CA, VD 
Analysis Commands→ Advanced Statistics→ Complex Sample MeansAdd Analysis Gadget→ Advanced StatisticsComplex Sample MeansCalculates the mean of the specified variable given an additional variable representing the PSU.CA, VD 



Future Development

  1. CA linear regression does not work when "confidence limits" are specified using the pull-down menu in the Command Explorer dialog. Similarly, the corresponding commands in the form "REGRESS WEIGHT=HEIGHT PVALUE=<val>" fail when using the PVALUE parameter and reasonable arguments such as 90%, 95%, and 99%.
  2. Add features in Visual Dashboard that are not yet in Classic Analysis (as appropriate) including Word Cloud “graph”, Combined Frequency, Matched Pair Case Control (Epi Info™ 3 had this functionality implemented as the command “MATCH”), and Duplicates List.