Data Processing

RequirementData Processing
Section3.2.3.2
JIRA Task

EIR-61 - Getting issue details... STATUS

Reviewed ForDate
Conventional spacing between sections2016-08-31

Introduction

In Classic Analysis, Data Processing encompasses the steps required to prepare data for analysis using the Epi Info™ 7 suite of statistical tools. Commands may be selected from the Command Explorer tree and configured using the command definition dialog, a graphical interface for selecting parameters and arguments. The process begins by loading project data into memory. The project's data tables can be analyzed directly, or joined or merged with tables from other projects (as well as the current project). Classic Analysis provides a number of tools to render the raw data suitable for statistical analysis. New variables may be defined and assigned values based on form variables directly or manipulated to rescale, recode or reformat the data to make them compatible with subsequent procedures or statistical tests. Simple arithmetic and logical operators are available for operations on numeric and Boolean variables.  More complex mathematical operators are implemented as functions, as are the tools for manipulating text, dates and other variable types.  Classic Analysis also supports an internal scripting language that can be typed into the Program Editor manually or generated using commands from the Command Explorer and configured with the corresponding dialog. Programs may be executed, all or in part, using the Program Editor or in batch mode, outside of the Epi Info™ application. Tables at the end of this section list and summarize the functions and operators available in Classic Analysis with links to their User Guide entries where more information can be found.


Data Management

Overview

In Classic Analysis, Data Processing begins by loading project data into memory, either directly, with the READ command, or by executing a program or user-defined command that in turn loads the data. This and other data management functions are located under the Analysis Commands→Data tab in Command Explorer.  READ supports loading data from a standard Epi Info 7 project, as well as ASCII flat files, Microsoft Access 2002-2003 (.mdb), Access 2007 (.accdb) files, SQL Server Databases, and other file formats. Once the data  has been loaded, it can be merged (MERGE command) with other data or related (RELATE command) to other tables in a join operation based on the tables’ key fields.  The user may delete individual records (DELETE RECORDS command) based on a simple query, either permanently or only as "targeted for deletion"; records in the latter category can be undeleted (UNDELETE RECORDS command) using a similar mechanism.  Data can be written out (WRITE command) to a new table in the current project or to any of the other database or file formats.  Modifications as a result of any other commands (SELECT, ASSIGN, RECODE, etc.) are only in effect for the data in memory.  Changes are not made to the underlying data in the database.  Modifications to the data in the database are saved only after the WRITE command is run.  If the modifications are to be visible in the form in the Enter module, the updates written to a table must be merged into the form’s data tables using the MERGE command.


Functional Requirements

The following list contains the requirements for the commands described under Data Management and located under the Analysis Commands→Data tab in Command Explorer The command names in all-caps are written as they appear (and as they would be used) in the Program Editor.  In most cases, this corresponds to the command name in Command Explorer.  In cases where the names differ, the Command Explorer name will follow the command in parentheses. This convention will be used throughout this section.

  1. The READ command shall enable the user to load data in the following formats:
    1. Epi Info™ 7 Project (.prj), from which, the user may select:
      1. Epi Info™ Forms (which merges the data tables from each page in the form),
      2. other tables in the selected project database, 
    2. MS Access 2002-2003 (.mdb), 
    3. MS Access 2007 (.accdb), 
    4. MS Excel 1997-2003 (.xls) workbook,
    5. MS Excel 2007 (.xlsx) workbook,
    6. CSV text files,
    7. mySQL databases,
    8. postgreSQL databases, and
    9. SQL Server databases.
  2. The RELATE command shall enable the user to join another form or table to the current data in memory[1]
  3. The MERGE command shall enable the user to:
    1. append and merge the data currently in memory with another table comprised of identical or largely overlapping fields[2], and
    2. update existing records with the new data based on one (or more) key fields[3].
  4. The WRITE command shall enable the user to write one or more selected variables to:
    1. create a new table or append to or replace data in an existing table in a database in the following formats:
      1. MS SQL Server,
      2. PostgreSQL, and 
      3. MySQL, as well as to:
    2. create, append to, or replace the following file formats:
      1. Epi Info 7 Project (.prj),
      2. MS Access 2002-2003 (.mdb), 
      3. MS Access 2007 (.accdb), 
      4. MS Excel 1997-2003 (.xls) workbook,
      5. MS Excel 2007 (.xlsx) workbook, and
      6. CSV text files.
  5. The DELETE (Delete File/Table) command shall enable the user to:
    1. delete an arbitrary file from the file system,
    2. delete a named table from the current data set, and 
    3. delete a named table from one of the supported databases listed in 2a.
  6. The DELETE (Delete Records) command shall enable the user to delete permanently or tag for deletion records identified by querying for:
    1. an expression containing one or more numeric variables or a literal value (constant) by a comparison operator such as >, ≥, =, ≤, < or Missing;
    2. an expression containing one or more text variables related to another text string by a comparison operator such as >, ≥, =, ≤, <, LIKE or Missing;
    3. an expression containing one or more Boolean variables evaluated by the comparison operators = or != to TRUE or FALSE, or Missing; and
    4. a combination of two or more comparison operations of the above types joined by AND or OR.
  7. The UNDELETE Records command shall enable the user to remove the tag-for-deletion flag for records identified by queries described in 6.


Notes:

  1. The related form or table may be in the same or different project.  In order to relate two forms, one or more key fields are required.  The relationship between the forms may be a one-to-one or one-to-many (i.e., a parent-child relationship).  An example of this would be READ the Sample.prj Surveillance form, then RELATE to the Sample.prj RLyme form using the GlobalRecordId :: FKey as the key fields for the relationship.  RELATE, in effect, makes the current table wider by adding new fields to the table currently in memory
  2. The MERGE command is typically used to combine data from similar copies of the same project (e.g., separate data-collection efforts using the same or slightly modified versions of the same form).  The tables have largely the same fields but the data to be merged in have additional records (to be appended) and/or modified records (to be updated).  
  3. Like the RELATE command, MERGE requires one or more keys to establish which records the data sets have in common for the purpose of updating records. Unlike RELATE, the MERGE command makes a permanent change to the underlying database by adding or updating records.  MERGE, in effect, permanently lengthens the current table (and underlying database) by adding new records (and updating records, when possible).


Variable Management and Manipulation

Overview

In the use of the statistical tools in Classic Analysis, the user is not limited to the variables defined in the original form.  The user may DEFINE new variables with specific scope and type, then ASSIGN values to them.  All variables may be assigned constant values or values based on expressions using one or more of the original form variables or new variables, previously defined (using the DEFINE command) and assigned (using the ASSIGN command).  Allowable expressions depend on the type of variable being assigned.  Numeric variables may be assigned expressions involving other numeric variables and the simple numerical operators +-*, and /, supplemented by functions including explogelog10, sincosetc.  Text variables may be assigned expressions using string functions such as findtext or substring.  The remaining variable types–including date, time and Boolean–all have type-appropriate functions for operations on those variables.  

There are also functions, such as txttodate and txttonum, that convert variables of one type to another (in this case, converting text variables into date or numeric types, respectively).  Variables can also be assigned new values using RECODE. RECODE is particularly useful for converting continuous numeric variables into discrete values or Boolean values.  One may also UNDEFINE a variable and remove it from the system.  A collection of variables may be aliased using DEFINEGROUP; this allows all variables in the group to be referenced using the alias for statistical functions.  Finally, the DISPLAY command produces a data dictionary for the selected variables, showing variable name, field type, field prompt, source table, and other potentially useful information.

Functional Requirements

The following list contains the requirements for the commands described under Variable Management and Manipulation and located under the Analysis Commands→Variables tab in the Command Explorer.

  1. The DEFINE command shall enable the user to create new variables with user-defined names and the following properties:
    1. Scope, for each variable, one of the following:
      1. Standard, persists until the next invocation of READ;
      2. Global, persists until termination of Classic Analysis[1]; or
      3. Permanent, created in file "<installDir>\Configuration\EpiInfo.Config.xml", persists until undefined (using the UNDEFINE command).
    2. Type, for each variable, one of the following:
      1. Number,
      2. Text,
      3. Date,
      4. Time,
      5. DateTime,
      6. Boolean,
      7. PhoneNumber,
      8. YesNo,
      9. Unknown,
      10. GUID, or
      11. Object (requiring a DLL Object Definition)[2].
  2. The UNDEFINE command shall enable the user to remove a variable of any Type and Scope from the data table or config file where it is defined and stored.
  3. The ASSIGN command shall enable the user to set the value of any variable to:
    1. a constant consistent with the variable's type (as enumerated in 1b, above), or
    2. an expression that evaluates to a value consistent with the variable's type (as enumerated in 1b, above) using one or more of the functions and operators defined in the tables Summary of Functions and Summary of Operators.
  4. The RECODE command shall enable the user to:
    1. map a continuous numeric variable to finite number of discrete values corresponding to ranges of numeric values;
    2. map a continuous date, date/time, or time variable to a finite number of discrete values corresponding to date, date/time, or time intervals;
    3. map a text variable containing arbitrary values to a finite number of discrete values corresponding to ranges based on lexicographic order;
    4. map a variable with a number of discrete values to an equal or smaller number of discrete values in a variable of the same or different type; and
    5. automatically generate numeric intervals given a start, end, and interval size.
  5. The DEFINE <alias> GROUPVAR <var1> [<var2> ... ] (DefineGroup) command shall enable the user to create a variable that serves as an alias for the names of other variables.
  6. The DISPLAY command shall enable the user to:
    1. choose variables, 
      1. selected from those that are:
        1. currently available,
        2. currently available and user-defined,
        3. currently available and from form fields,
        4. currently available and manually-selected, and 
        5. from each form in the current project;
      2. generate and save the code to display (and, optionally, write to the database) a table with the following parameters:
        1. PageNumber,
        2. Prompt,
        3. Field Type,
        4. Variable (name),
        5. Value,
        6. Format/Value,
        7. Special Value, and
        8. Table;
    2. form metadata that 
      1. comes from:
        1. the current database, or
        2. one browsed from the file system;
      2. can be used to generate and save the code to display (and, optionally, write to the database) a table with the following parameters:
        1. DataTableName,
        2. Type,
        3. Link,
        4. Parent View,
        5. Child Type, and
        6. Child Tables;
    3. create tables that
      1. come from:
        1. the current database, or
        2. one browsed from the file system;
      2. can be used to generate and save the code to display (and, optionally, write to the database) a table with the following parameters:
        1. DataTableName,
        2. Type,
        3. Number of Fields,
        4. Link,
        5. Primary Key, and
        6. Description.

Notes:

  1. Described in Windows Task Manager→Processes as: Image Name = "Analysis.exe *32", Description = "Epi.Windows.Analysis".
  2. The variable type selections made in the DEFINE dialog (as listed above) are degenerate with respect to the code generated in the Program Editor.  For example, variables defined as Boolean or YesNo both generate the code: "DEFINE <varName> YN", indicating there is no difference in the internal representation of those variable types.  This is borne out in the Field Type indicated by the "DISPLAY DBVARIABLES" command ("Checkbox") and by the command "LIST <var1> <var2> GRIDTABLE", which shows a checkbox field.  Additional development work is required to make variable types consistent from Form Designer through to Classic Analysis (and probably Visual Dashboard as well). [Future Work]


Record Selection and Manipulation

Overview

The Command Explorer folder Analysis Commands→Select/If contains commands to temporarily Select and Sort data records based on form variables and variables defined and assigned using the DEFINE and ASSIGN commands, respectively, from the Analysis Commands→Variables folder discussed in the previous subsection.  SELECT may be used to filter a dataset based on a selection criterion defined using literal values or an expression involving one or more of the available variables, functions and operators as described in the linked tables located at the end of this page.  Multiple SELECT commands can be executed to progressively narrow the available records; CANCEL SELECT removes the (accumulated) criteria, returning the record count to its original value. The IF command allows the user to build an if/then/else statement using a conditional expression similar to the selection criterion used with the SELECT command.  The condition is logically evaluated for each record and one of the two outcomes is executed, depending on the evaluation of the condition. The commands can be simple variable assignments or some of the more complex operations from the Command Explorer.  Finally, the SORT command allows the user to order data records using one or more variables as ascending or descending sort keys.  The records may be returned to their original order using CANCEL SORT.

Functional Requirements

The following list contains the requirements for the commands described in Record Selection and Manipulation and located under the Analysis Commands→Select/If tab in Command Explorer.

  1. The SELECT <criteria> (Select) command shall enable the user to:
    1. define a conditional expression (the SELECT clause) containing literal values or one or more variables potentially modified by available functions and/or operators, and
    2. modify the content of the data table in memory by: 
      1. selecting (or retaining) records that cause the conditional expression to evaluate as TRUE (due to the values of the record's variable(s) in the conditional expression), and
      2. ignoring (i.e., deleting or excluding) records that cause the conditional expression to evaluate as FALSE.
  2. SELECT without arguments (Cancel Select) shall reverse the effect of the SELECT command.
  3. The IF command shall enable the user to:
    1. define a conditional expression (the IF clause) containing one or more variables plus any required functions and/or operators,
    2. define an action (the THEN clause) that is executed when the IF clause evaluates as TRUE, and
    3. define an action (the ELSE clause) that is executed when the IF clause evaluates as FALSE.
  4. The actions in the THEN and ELSE clauses of an IF command shall support:
    1. all functions and operators listed in the Functions and Operators Reference, and
    2. all commands listed in this section (Classic Analysis: Data Processing), including another IF command, forming a nested structure.
  5. The SORT <key> (Sort) command shall enable the user to:
    1. identify one or more variables from the data table to act as sort keys,
    2. maintain the order of selection to reflect the precedence of sort keys, and
    3. toggle the sort order for (each) given key (ascending or descending).
  6. SORT without arguments (Cancel Sort) shall reverse the effect of the SORT command.


Output, User Interaction and User-Defined Commands

Overview

The Command Explorer folders Analysis Commands→Output and Analysis Commands→User Interaction are intended primarily for use within a program rather than interactive use.  HEADER and TYPE are used to write to the output stream text that is either supplied in the command itself or read from a file.  ROUTEOUT causes the HTML, usually written just to the output window, to be written to the named file as well.  CLOSEOUT terminates output to the file and closes it.  PRINTOUT sends the named file or current output file to the default printer.  STORING OUTPUT opens a configuration dialog in which parameters associated with output files and archives are set.  The user interaction commands allow a program to collect information from the user interactively (DIALOG), emit a beep (BEEP) and terminate the execution of the program (QUIT). Analysis Commands→User-Defined Commands provides a mechanism for grouping Classic Analysis commands under a single name, which can then be executed with the command "CALL <Name>".  The folder also contains methods for running Windows executables (EXECUTE) and programs using the Classic Analysis program interpreter (RUNPGM).

Functional Requirements

The following list contains the requirements for the commands described in this section's overview and located in the Command Explorer under Analysis Commands→Output and Analysis Commands→User Interaction.  The command names in all-caps are written as they appear (and as they would be used) in the Program Editor.  In most cases, this corresponds to the command name in Command Explorer.  In cases where the names differ, the Command Explorer name with follow the command in parentheses. 

  1. The HEADER command shall enable the user to specify a text string to:
    1. label the output frame tab, and 
    2. be printed to the output stream with a specific font size (relative to the default size), style and color.
  2. The TYPEOUT (Type) command shall enable the user to:
    1. specify a text string or (browse for) the name of a file, the content of which is to be printed to the output stream,
    2. with a specific font size (relative to the default size), style and color.
  3. The ROUTEOUT command shall enable the user to:
    1. specify a file name into which program output will be written (or forked, in interactive mode).
  4. The CLOSEOUT command shall enable the user to:
    1. close the file opened by ROUTEOUT command.
  5. The PRINTOUT command shall enable the user to send to the default printer the contents of:
    1. a specified file, or
    2. the current file opened with the ROUTEOUT command.
  6. The (Storing Output)[1] command shall enable the user to set the following parameters (interactive only):
    1. output file prefix,
    2. output file sequence,
    3. results folder name,
    4. archive folder name, and
    5. limits for properties of file that will be flagged if exceeded:
      1. age in days,
      2. number of results, and 
      3. file size (in KB).
  7. The (Storing Output)[1] command shall enable the user to perform the following operations on:
    1. files in the selected results folder:
      1. view,
      2. archive, and 
      3. delete; and
    2. files in the selected archive folder:
      1. view and
      2. delete.
  8. The  DIALOG command shall enable the user to create the following three dialog types with the indicated properties:
    1. "Simple"
      1. with text strings for:
        1. title and
        2. prompt.
    2. "Get Variable"
      1. with text strings for:
        1. title and
        2. prompt.
      2. pull-down menus for:
        1. input variable and
        2. variable type. 
      3. additional input fields and/or menus, as indicated, dependent on variable type:
        1. date, date/time, date, Boolean, YesNo, unknown, GUID, object:
          1. none;
        2. number:
          1. input mask;
        3. text:
          1. dialog type, with the following options:
            1. text entry,
            2. multiple choice,
            3. variable list,
            4. view list,
            5. database list,
            6. file open, and
            7. file save;
          2. length;
        4. phone number; 
          1. input mask;
    3. "List of Values",
      1. with text strings for:
        1. title and
        2. prompt.
      2. with pull-down menus for:
        1. input variable, populated with the names of available variables;
        2. variable type, populated with number, text, date, date/time, time, Boolean, phone number, YesNo, unknown, GUID, object;
        3. show table, populated with the names of all tables associated with project; and
        4. show variable, populated with the fields associated with the table selected in "show table", above (8.c.ii.3).
  9. The BEEP command shall enable the user to:
    1. cause the system to emit a "beep" (ASCII "bell" character, 0x07).
  10. The QUIT (Quit Program) command shall cause the program to terminate its execution (in batch mode or when the code is executed in the Program Editor), or (when in interactive mode):
    1. prompts the user with the dialog "Exit the program?", and
    2. allows the user to click:
      1. "Cancel", which returns the user to the program;
      2. "Save Only", which writes the command to the Program Editor but does not actually terminate the program;
      3. "OK", which terminates the program; or
      4. "Help", which opens the default browser to the "Quit" page of the User Guide.
  11. The CMD (Define Command) command shall enable the user to:
    1. name the command,
    2. associate that name with a list of Classic Analysis commands, and
    3. generate the command list with a menu-driven system similar to how commands are chosen from Command Explorer, which then creates a dialog to prompt the user for parameters.
  12. The CALL (User Command) command shall enable the user to execute a command created using CMD (Define Command) and saved to the current project file.
  13. The RUNPGM (Run Saved Program) command shall enable the user to run a command that has been saved to a .pgm7 file (using the Classic Analysis command interpreter).
  14. The EXECUTE (Execute File) command shall enable the user to execute a Window executable (.exe) file.

Notes:

  1. The "Storing Output" command is only available for interactive use.  There is no corresponding Program Editor command.


Future Development

Functional Requirements

  1. The ROUND functional shall enable the user to specify rounding precision:
    1. to the left or right of a fixed decimal point given the variable and an exponent indicating rounding to the nearest 10n or, if negative, 10n th, i.e.ROUND( <var> [, <int>] )e.g.:
      1. ROUND( 3.1415926535, -4) = 3.1416,
      2. ROUND( 299792458, 3 ) = 299792000, or
    2. based on the desired number of significant figures, i.e.ROUND( <var>[[, <int>], "SIG"] )e.g.:
      1. ROUND( 3.1415926535, 3, "SIG" ) = 3.14,
      2. ROUND( 299792458, 3, "SIG" ) = 300000000.




Functions and Operators Reference

The Epi Info™ 7 User Guide provides a reference for Functions & Operators available in Classic Analysis. Each entry contains:

  1. a brief description of the command's purpose and function;
  2. the execution syntax, a formal description of the usage of command arguments and parameters, including constraints on allowed values (i.e., which are mandatory and which are optional);
  3. optional additional comments; and
  4. an example of command usage, including arguments and actual parameters.

The following table shows the available Classic Analysis (CA) functions and operators, describing the purpose and output of each, with links to the corresponding sections of the User Guide. 


Summary of Functions

Summary of Functions for Variable Manipulation

Classic Analysis Function

PurposeUser Guide Links

ABS

Returns |x|, the absolute value of x.CA
DAYReturns the numeric day given a date variable.CA
DAYS (date1, date2)

Returns the difference in days between two date or date/time variables (result is negative if date2 is before date1)

CA
EXISTSGiven a fully-qualified file name, returns TRUE if it exists, FALSE if it does not.CA
EXPReturns ex given numeric variable x.CA
FILEDATEGiven a fully-qualified file name, returns the date of creation or last modification (whichever is later).CA
FINDTEXTReturns the position (starting from 1) of the first matching character of a substring within another text string or zero if the substring is not found.CA
FORMATReturns a formatted string (text), given a variable of arbitrary type according to the supplied format specification; see link for additional information.CA
HOURReturns the numeric hour given a date/time or time variable.CA
HOURS (time1, time2)

Returns the difference in hours between two date/time or time variables (result is negative if time2 is before time1).

CA
LNReturns loge(x), the natural logarithm of x.CA
LOGReturns log10(x), the base 10 logarithm of x.CA
MINUTES (time1, time2)

Returns the difference in minutes between two date/time or time variables (result is negative if time2 is before time1).

CA
MONTHReturns the numeric day given a date variable.CA
MONTHS (date1, date2)

Returns the difference in months between two date or date/time variables (result is negative if date2 is before date1)

CA
NUMTODATEReturns the date corresponding to the three numeric arguments: year, month, and day.CA
NUMTOTIMEReturns the time corresponding to the three numeric arguments: hour, minute, and second.CA
RECORDCOUNTReturns the number of records in the current dataset view.CA
RNDReturns a random integer (based on the truncated [cf. TRUNC] value of a random floating-point number) within the range specified by the arguments: min and max.CA
ROUND

Returns an integer given a floating-point value such that:

if(x-TRUNC(x) < 0.5){ return(TRUNC(x)) }else{ return(TRUNC(x)+1) }.

CA
SECONDS (time1, time2)

Returns the difference in seconds between two date/time or time variables (result is negative if time2 is before time1).

CA
SIN, COS, TANReturns the values of the trigonometric functions sin(x), cos(x), or tan(x), where the numeric variable x represents degrees of arc.CA
SUBSTRINGReturns the substring from a text variable given position of the first character and the desired length.CA
SYSTEMDATEReturns the current system date to a date or date/time variable.CA
SYSTEMTIMEReturns the current system time to a time variable.CA
TRUNCReturns the integer portion of a floating point value.CA
TXTTODATEReturns a date variable corresponding to a text argument "in any format that can be recognized as a date", e.g. "7/4/1976", "Jul 4, 1976".CA
TXTTONUMReturns the numeric variable corresponding to the given text string.CA
UPPERCASEReturns the given text string with its alphabetic characters translated to uppercase.CA
YEARReturns the numeric year given a date variable.CA
YEARS(date1, date2)Returns the difference in years between two date or date/time variables (result is negative if date2 is before date1)CA


Summary of Operators

Summary of Operators for Variable Manipulation

Classic Analysis Operator

PurposeUser Guide Links
& (ampersand)string concatenationCA
= (equals sign, assignment)variable assignmentCA
+, -, *, /, % (arithmetic)addition, subtraction, multiplication, division, modulusCA
=, >, >=, <, <=, <>, LIKE (comparison)equals, greater than, greater than or equals, less than, less than or equals, not equals, approximate string match with wildcardsCA
LIKE (with SELECT)approximate string match with wildcards in SELECT statementsCA
ANDlogical ANDCA
NOTlogical negationCA
ORlogical ORCA
XORexclusive ORCA