Next: matrices Up: MacAnova Help File Previous: matread()   Contents

matread_file

Keywords: variables, files, input, output
This topic discusses the format of files to be read by matread() and
read().  They are plain text files which contain named data sets, each
starting with one or more header lines, such as are written by
matprint() and matwrite().

matread() and read() behave identically except that read() does not
print a warning message when the name requested belongs to a macro
rather than a data set.  In the following, you can substitute 'read()'
for 'matread()'.

No more than 50 numerical items can be on any single line of a data set
read by matread() or read().

A single file can contain one or more named data sets corresponding to
any type of variable except GRAPH.  This includes REAL, LOGICAL and
CHARACTER data, as well as NULL variables, macros and structures.  Data
sets can have coordinate labels and/or descriptive notes.  See topics
'variables', 'logic', 'NULL', 'notes'.  The file can also have macros
readable by macroread() mixed in among the data sets.  See topics
'macro_files' and macroread().

Every data set must have one or more header lines which give the data
set name, information about its structure and internal format.  The
first line starts with the name and is called the 'name line'.
matread(fileName, setName) searches the file for the first line which
starts with 'setName'.  Such a line is assumed to be the name line for
the data set.

Any data set may optionally be followed by a line starting '%setName%',
where 'setName' is its name.  When this is done, keyword 'ENDED' should
appear on the name line of the data set.  When the data sets has
associated labels or notes, '%setName%' should follow them.  This usage
allows matread() to skip past a data set without examining individual
lines.  Even if a line in such a data set starts with the name of
another data set, matread() will not "see" it and mistakenly treat it as
a name line.

A line starting _E_N_D_O_F_M_A_C_R_O_S_ terminates a search for a data
set or a macro.  You can put help or other information after this line
without the danger that a line might be mistaken for the start of a data
set.

                       Format of data set header
The first header line for each data set starts with its name, followed
by dimension information and possibly keywords.  This line may be
optionally followed by descriptive comment lines starting with ')'.
These may also provide information on the number of values per line and
coding for MISSING values.  Thus a data set starts with the following
general form:
   Name Dims Keywords
   ) 0 or more descriptive or comment lines starting with ')', referred
   ) to as 'comment lines' below
   ) .....

Name is the name of the data set ('mydata', say) to be matched to
setName, the second argument to matread().  Case is ignored in searching
for the name, so that matread(fileName, "mydata") will find mydata,
MyData, or MYDATA, for example.

Dims, the dimensions of the data set, is a list of 1 or more positive
integers.  When Dims is a single number ('mydata 20'), the data set is a
vector of length Dims or a structure with Dims components.  When it is
two numbers, nrows and ncols, ('mydata 20 5') the dataset is a nrows by
ncols matrix.  If Dims consists of p >= 3 numbers, the data set is a
p-dimensional array.

It is also acceptable for Dims to be '0', in which case no data is
expected.  When such a data set is read by matread(), the comment lines
are printed and NULL is returned.  A useful convention is to have the
first "data set" on a file be empty, with the comments describing the
remainder of the file.  See topic 'NULL'.

        Keywords that may be put on the first line of the header
  Keyword                            Description
  CHARACTER           The data set is a CHARACTER variable in either "by
                      fields" or "by lines" format (see below)
  COLS or COLUMNS     The data follow in transposed form.  For a matrix,
                      this is in column by column order, each column
                      starting on a new line.
  ENDED               A line starting '%setName%', where 'setName' is
                      the name of the data set, immediately follows the
                      data set.  This does not affect how the data set
                      is read.
  FORMAT              Indicates that a Fortran format starting with '('
                      will follow the last comment line.  It is ignored
                      by MacAnova but might help a program written in
                      Fortran to read the data.
  LABELS              The data set has coordinate labels which follow
                      the data in the file (see below)
  LOCKED              If the result of matread(), macroread() or read()
                      is assigned to a variable, that variable will be
                      locked.  See topic 'locks'.
  LOGICAL             The data set is a LOGICAL variable, with False and
                      True represented as zero and non-zero values,
                      respectively.  Most commonly these are 0 and 1.
  NOTES               The data set has attached descriptive notes which
                      follow the data in the file (see below).
  NULL                The data set is a NULL variable, containing no
                      data, although there may be comment lines.  Dims
                      must be 0.  There can be no other keywords.  See
                      topic 'NULL'.
  QUOTED              The data set is a CHARACTER variable in "by quoted
                      fields" format (see below).
  REAL                The data is a REAL variable.  This is the default
                      and hence REAL is never required
  ROWS                The data follow a row at a time (constant value
                      for first subscript), each row starting on a new
                      line.  This is the default and hence ROWS is never
                      required.
  STRUCTURE           The data set is a structure.

Upper and lower case letters are not distinguished in these keywords,
so, for example, 'macro' and 'Macro' are both recognized as the same as
'MACRO'.

A vector (single dimension specified) is treated like a matrix with a
single column.  That is, if COLS or COLUMNS is specified it should all
be on one line, and if not, every element must be on a separate line.

                      Conventions on Comment lines
  ) LOGICAL      The data are to be interpreted as being LOGICAL, with
                 zero and non-zero values translated to F and T,
                 respectively.  This is retained for backward
                 compatibility and is ignored for CHARACTER, NULL or
                 STRUCTURE data sets.

  )"%f %f... %f" specifies a format for each row of a REAL or LOGICAL
                 data set that is analogous to that used by scanf in the
                 C programming language.  If the data set is CHARACTER,
                 '%f' is replaced by '%s'; see below.  Let N1 and Nk be
                 the first and last dimensions.  If there are fewer
                 "%f"'s than Nk (or fewer than N1 if COLS or COLUMNS is
                 specified), then this indicates that, for each value of
                 the last index (first index with COLS or COLUMNS),
                 there are several lines in the file containing data.
                 Each such line, except possibly the last, must have the
                 same number of data items as there are %f's.  If no
                 explicit format is given, one with Nk or N1 (if COLS or
                 COLUMNS is on line 1 of the header) %f's is assumed.
                 No more than 50 values can be put on a single line.

  )"NNx%f %f ... %f" where NN is an integer, causes the first NN
                 characters of each line to be skipped.  This allows you
                 to skip case labels or line numbers.  Example: )"12x%f
                 %f" skips 12 characters at the start of each line.

  )"%s %s ... %s" specifies a format for each row of a CHARACTER data
                 set.  If present, the data will be expected to be in
                 "by fields" format or "by quoted fields" (if QUOTED is
                 on header line) format (see below).  The number of
                 "%s"'s is the maximum number of elements that will be
                 read per line.

  )"NNx%s %s ...%s" where NN is an integer causes the first NN
                 characters of each line read to be skipped before
                 scanning for CHARACTER data in "by fields" or "by
                 quoted fields" format.

  ) MISSING XX   where XX is a number indicates that XX in the data set
                 is to be read as MISSING.  The default missing value
                 code is -99999.9999.  Because only integers can be
                 guaranteed to be represented exactly in the computer,
                 it is preferable for XX to be an integer, positive or
                 negative.  Example: ) MISSING -99 specifies MISSING is
                 coded as -99.  This is ignored for data sets that are
                 not REAL or LOGICAL.  MISSING must be all upper case.

                         Character data formats
There are three possible formats for CHARACTER data, 'by lines', 'by
fields' and 'by quoted fields'.  'Quoted' means enclosed in "'s as in
"Regression analysis".

 By lines
   Each element starts on a new line and is not quoted.  If an element
   extends over more than one line, each line except the last must end
   with '\'.  This format is signaled by the presence of CHARACTER on
   the header and the absence of any ")%s..."  format among the comment
   lines.

 By fields
   Each stretch of "non-white" characters on a line is considered to be
   an element.  This format is signaled by the presence of CHARACTER on
   the header and the presence of a ")%s..." format among the comment
   lines.  The number of fields on a line is the number of "%s"'s in the
   format.  Commas are treated as non-white characters and do not
   separate fields.

 By quoted fields
   Each element must be enclosed in quotes ("...") and elements in a
   line are separated by spaces, tabs, and possibly a comma.  This
   format is signaled by the presence of QUOTED on the header.  If there
   is no ")%s..." format among the comment lines, the number of items
   expected per line is the size of the last dimension or 1 if the data
   set is a vector.  If COLUMNS or COLS is on the header, the default
   number expected per line is the size of the first dimension.  If
   there is a ")%s..."  format, the number of elements expected per line
   is no more than the number of "%s"'s in the format, with any
   additional lines, except the last, having the same number of
   elements.

                     Format for structure data sets
The name line for a structure data set must be of the form
   strname   ncomps  STRUCTURE
where strName is the name of the structure and integer ncomps > 0 is the
number of components.  There must follow ncomps data sets, each in one
of the formats just described, or in the format for a structure.  Each
component must have a name of the form strName$compName, where compName
is the name of the component.  If a component is a structure, then the
names of its components would thus be strName$compName$compName1.

Each structure component *must* be preceded by at least one blank line.
An individual component can be read like any other data set by
specifying its full name, "mystruc$b", for example.

                      Format for labels and notes
If a variable with name "x", say, has coordinate labels (see topic
'labels'), the header line must contain keyword "LABELS", and the labels
for all coordinates must be in a CHARACTER vector with name "x$LABELS"
immediately following the data associated with x.  When ndims(x) > 1,
the labels for the first dimension come first, followed by those for the
second dimension, and so on, all in one vector.  Thus the length of the
vector normally matches the sum of the dimensions of x.  Because they
are written in the usual form for a CHARACTER vector, you can read the
labels without reading the data by matread(fileName, "x$LABELS").

Labels may be "expanded" similarly to the expansion done by setlabels().
Specifically, if the number of x$labels is less than the sum of
dimensions of x, any label starting with '@' or of the form '(', '[',
'{', '<', '/', or '\' is expanded to the length of the appropriate
dimension.  For example, labels vector("@[", "X1", "X2") for a 10 by 2
matrix are equivalent to vector(rep("@[",10), "X1","X2").  See topics
'labels', setlabels().

If variable x has attached notes (see topic 'notes'), the header line
must contain keyword "NOTES", and the notes must be in a CHARACTER
vector with name "x$NOTES" immediately following the data or labels.
Because notes are written in the usual form for a CHARACTER vector, you
can read them without reading the data by matread(fileName, "x$NOTES").

                      Example data file, data.txt

  info           0
  ) Sample data file containing REAL data set sampledata,
  ) CHARACTER data set samplechars, and structure mystruct

  sampledata     4     3 COLUMNS LABELS NOTES ENDED
  ) Small REAL data set with one missing value coded as -99.
  ) Each line contains data for one column (COLUMNS on header)
  ) MISSING -99
  ) '4x' in the following format skips 4 characters (variable label)
  )"4x%f %f %f %f"
  Temp   34.5   45.2  23.1   20.1
  Conc   .170   -99  .883    .401
  Secs    3.5   4.7   3.2     5.8

  sampledata$LABELS    4   QUOTED COLUMNS
  ) Labels for sample data in quoted format by columns
  ) Labels are expanded to "@" "@" "@" "@" "Temp" "Conc" "Secs"
  )"%s %s %s %s"
   "@" "Temp" "Conc" "Secs"

  sampledata$NOTES     1 CHARACTER
  ) Notes for sampledata in "by line" format
  Small REAL data set with one missing value.
  %sampledata%

  samplechars    2     4 CHARACTER
  ) 4 by 2 CHARACTER matrix with each row in 2 lines containing
  ) 3 and 1 unquoted fields
  )"%s %s %s"
  This is by-fields
  format
  without any double
  quotes

  mystruc  2  STRUCTURE
  ) this is a structure with two components, a and b
  ) The blank line before the header of each component is required

  mystruc$a  2 QUOTED COLUMNS
  ) character vector of length 2
  ) Two quoted fields
  "The quick brown fox" "Jumps over the lazy dog"

  mystruc$b  2 STRUCTURE
  ) This component is a structure with two components, pi and e

  mystruc$b$pi  1   1
  ) 1 by 1 matrix
  3.14159265358979

  mystruc$b$e   1
  ) vector of length 1
  2.71828182845905

              Examples of reading data sets from data.txt

  Cmd> sampledata <- matread("data.txt","sampledata", quiet:T)

  Cmd> print(sampledata) # note that labels were read
  sampledata:
              Temp         Conc         Secs
  (1)         34.5         0.17          3.5
  (2)         45.2      MISSING          4.7
  (3)         23.1        0.883          3.2
  (4)         20.1        0.401          5.8

  Cmd> getnotes(sampledata) # notes were retrieved as well as data
  (1) "Small REAL data set with one missing value."

  Cmd> notes <- matread("data.txt", "sampledata$notes"); notes
  (1) "Small REAL data set with one missing value."

  Cmd> samplechars <- matread("data.txt","samplechars",quiet:T)

  Cmd> print(samplechars)
  samplechars:
  (1,1) "This"
  (1,2) "is"
  (1,3) "by-fields"
  (1,4) "format"
  (2,1) "without"
  (2,2) "any"
  (2,3) "double"
  (2,4) "quotes"

  Cmd> mystruc <- matread("data.txt","mystruc",quiet:T)

  Cmd> print(mystruc)
  mystruc:
  component: a
  (1) "The quick brown fox"
  (2) "Jumps over the lazy dog"
  component: b
    component: pi
  (1,1)       3.1416
    component: e
  (1)       2.7183

  Cmd> mystruc_b <- matread("data.txt", "mystruc$b",quiet:T)

  Cmd> print(mystruc_b)
  mystruc_b:
  component: pi
  (1,1)       3.1416
  component: e
  (1)       2.7183

In these examples, 'quiet:T' suppresses echoing the header line and
comments.  See matread().

See also topics matread(), matprint(), matwrite(), 'files',
'macro_files'.


Gary Oehlert 2003-01-15