Next: matrices
Up: MacAnova Help File
Previous: matread()
Contents
Keywords:
variables, files, input, output
This topic discusses the format of files to be read by matread() and
read(). They are plain text files which contain named data sets, each
starting with one or more header lines, such as are written by
matprint() and matwrite().
matread() and read() behave identically except that read() does not
print a warning message when the name requested belongs to a macro
rather than a data set. In the following, you can substitute 'read()'
for 'matread()'.
No more than 50 numerical items can be on any single line of a data set
read by matread() or read().
A single file can contain one or more named data sets corresponding to
any type of variable except GRAPH. This includes REAL, LOGICAL and
CHARACTER data, as well as NULL variables, macros and structures. Data
sets can have coordinate labels and/or descriptive notes. See topics
'variables', 'logic', 'NULL', 'notes'. The file can also have macros
readable by macroread() mixed in among the data sets. See topics
'macro_files' and macroread().
Every data set must have one or more header lines which give the data
set name, information about its structure and internal format. The
first line starts with the name and is called the 'name line'.
matread(fileName, setName) searches the file for the first line which
starts with 'setName'. Such a line is assumed to be the name line for
the data set.
Any data set may optionally be followed by a line starting '%setName%',
where 'setName' is its name. When this is done, keyword 'ENDED' should
appear on the name line of the data set. When the data sets has
associated labels or notes, '%setName%' should follow them. This usage
allows matread() to skip past a data set without examining individual
lines. Even if a line in such a data set starts with the name of
another data set, matread() will not "see" it and mistakenly treat it as
a name line.
A line starting _E_N_D_O_F_M_A_C_R_O_S_ terminates a search for a data
set or a macro. You can put help or other information after this line
without the danger that a line might be mistaken for the start of a data
set.
Format of data set header
The first header line for each data set starts with its name, followed
by dimension information and possibly keywords. This line may be
optionally followed by descriptive comment lines starting with ')'.
These may also provide information on the number of values per line and
coding for MISSING values. Thus a data set starts with the following
general form:
Name Dims Keywords
) 0 or more descriptive or comment lines starting with ')', referred
) to as 'comment lines' below
) .....
Name is the name of the data set ('mydata', say) to be matched to
setName, the second argument to matread(). Case is ignored in searching
for the name, so that matread(fileName, "mydata") will find mydata,
MyData, or MYDATA, for example.
Dims, the dimensions of the data set, is a list of 1 or more positive
integers. When Dims is a single number ('mydata 20'), the data set is a
vector of length Dims or a structure with Dims components. When it is
two numbers, nrows and ncols, ('mydata 20 5') the dataset is a nrows by
ncols matrix. If Dims consists of p >= 3 numbers, the data set is a
p-dimensional array.
It is also acceptable for Dims to be '0', in which case no data is
expected. When such a data set is read by matread(), the comment lines
are printed and NULL is returned. A useful convention is to have the
first "data set" on a file be empty, with the comments describing the
remainder of the file. See topic 'NULL'.
Keywords that may be put on the first line of the header
Keyword Description
CHARACTER The data set is a CHARACTER variable in either "by
fields" or "by lines" format (see below)
COLS or COLUMNS The data follow in transposed form. For a matrix,
this is in column by column order, each column
starting on a new line.
ENDED A line starting '%setName%', where 'setName' is
the name of the data set, immediately follows the
data set. This does not affect how the data set
is read.
FORMAT Indicates that a Fortran format starting with '('
will follow the last comment line. It is ignored
by MacAnova but might help a program written in
Fortran to read the data.
LABELS The data set has coordinate labels which follow
the data in the file (see below)
LOCKED If the result of matread(), macroread() or read()
is assigned to a variable, that variable will be
locked. See topic 'locks'.
LOGICAL The data set is a LOGICAL variable, with False and
True represented as zero and non-zero values,
respectively. Most commonly these are 0 and 1.
NOTES The data set has attached descriptive notes which
follow the data in the file (see below).
NULL The data set is a NULL variable, containing no
data, although there may be comment lines. Dims
must be 0. There can be no other keywords. See
topic 'NULL'.
QUOTED The data set is a CHARACTER variable in "by quoted
fields" format (see below).
REAL The data is a REAL variable. This is the default
and hence REAL is never required
ROWS The data follow a row at a time (constant value
for first subscript), each row starting on a new
line. This is the default and hence ROWS is never
required.
STRUCTURE The data set is a structure.
Upper and lower case letters are not distinguished in these keywords,
so, for example, 'macro' and 'Macro' are both recognized as the same as
'MACRO'.
A vector (single dimension specified) is treated like a matrix with a
single column. That is, if COLS or COLUMNS is specified it should all
be on one line, and if not, every element must be on a separate line.
Conventions on Comment lines
) LOGICAL The data are to be interpreted as being LOGICAL, with
zero and non-zero values translated to F and T,
respectively. This is retained for backward
compatibility and is ignored for CHARACTER, NULL or
STRUCTURE data sets.
)"%f %f... %f" specifies a format for each row of a REAL or LOGICAL
data set that is analogous to that used by scanf in the
C programming language. If the data set is CHARACTER,
'%f' is replaced by '%s'; see below. Let N1 and Nk be
the first and last dimensions. If there are fewer
"%f"'s than Nk (or fewer than N1 if COLS or COLUMNS is
specified), then this indicates that, for each value of
the last index (first index with COLS or COLUMNS),
there are several lines in the file containing data.
Each such line, except possibly the last, must have the
same number of data items as there are %f's. If no
explicit format is given, one with Nk or N1 (if COLS or
COLUMNS is on line 1 of the header) %f's is assumed.
No more than 50 values can be put on a single line.
)"NNx%f %f ... %f" where NN is an integer, causes the first NN
characters of each line to be skipped. This allows you
to skip case labels or line numbers. Example: )"12x%f
%f" skips 12 characters at the start of each line.
)"%s %s ... %s" specifies a format for each row of a CHARACTER data
set. If present, the data will be expected to be in
"by fields" format or "by quoted fields" (if QUOTED is
on header line) format (see below). The number of
"%s"'s is the maximum number of elements that will be
read per line.
)"NNx%s %s ...%s" where NN is an integer causes the first NN
characters of each line read to be skipped before
scanning for CHARACTER data in "by fields" or "by
quoted fields" format.
) MISSING XX where XX is a number indicates that XX in the data set
is to be read as MISSING. The default missing value
code is -99999.9999. Because only integers can be
guaranteed to be represented exactly in the computer,
it is preferable for XX to be an integer, positive or
negative. Example: ) MISSING -99 specifies MISSING is
coded as -99. This is ignored for data sets that are
not REAL or LOGICAL. MISSING must be all upper case.
Character data formats
There are three possible formats for CHARACTER data, 'by lines', 'by
fields' and 'by quoted fields'. 'Quoted' means enclosed in "'s as in
"Regression analysis".
By lines
Each element starts on a new line and is not quoted. If an element
extends over more than one line, each line except the last must end
with '\'. This format is signaled by the presence of CHARACTER on
the header and the absence of any ")%s..." format among the comment
lines.
By fields
Each stretch of "non-white" characters on a line is considered to be
an element. This format is signaled by the presence of CHARACTER on
the header and the presence of a ")%s..." format among the comment
lines. The number of fields on a line is the number of "%s"'s in the
format. Commas are treated as non-white characters and do not
separate fields.
By quoted fields
Each element must be enclosed in quotes ("...") and elements in a
line are separated by spaces, tabs, and possibly a comma. This
format is signaled by the presence of QUOTED on the header. If there
is no ")%s..." format among the comment lines, the number of items
expected per line is the size of the last dimension or 1 if the data
set is a vector. If COLUMNS or COLS is on the header, the default
number expected per line is the size of the first dimension. If
there is a ")%s..." format, the number of elements expected per line
is no more than the number of "%s"'s in the format, with any
additional lines, except the last, having the same number of
elements.
Format for structure data sets
The name line for a structure data set must be of the form
strname ncomps STRUCTURE
where strName is the name of the structure and integer ncomps > 0 is the
number of components. There must follow ncomps data sets, each in one
of the formats just described, or in the format for a structure. Each
component must have a name of the form strName$compName, where compName
is the name of the component. If a component is a structure, then the
names of its components would thus be strName$compName$compName1.
Each structure component *must* be preceded by at least one blank line.
An individual component can be read like any other data set by
specifying its full name, "mystruc$b", for example.
Format for labels and notes
If a variable with name "x", say, has coordinate labels (see topic
'labels'), the header line must contain keyword "LABELS", and the labels
for all coordinates must be in a CHARACTER vector with name "x$LABELS"
immediately following the data associated with x. When ndims(x) > 1,
the labels for the first dimension come first, followed by those for the
second dimension, and so on, all in one vector. Thus the length of the
vector normally matches the sum of the dimensions of x. Because they
are written in the usual form for a CHARACTER vector, you can read the
labels without reading the data by matread(fileName, "x$LABELS").
Labels may be "expanded" similarly to the expansion done by setlabels().
Specifically, if the number of x$labels is less than the sum of
dimensions of x, any label starting with '@' or of the form '(', '[',
'{', '<', '/', or '\' is expanded to the length of the appropriate
dimension. For example, labels vector("@[", "X1", "X2") for a 10 by 2
matrix are equivalent to vector(rep("@[",10), "X1","X2"). See topics
'labels', setlabels().
If variable x has attached notes (see topic 'notes'), the header line
must contain keyword "NOTES", and the notes must be in a CHARACTER
vector with name "x$NOTES" immediately following the data or labels.
Because notes are written in the usual form for a CHARACTER vector, you
can read them without reading the data by matread(fileName, "x$NOTES").
Example data file, data.txt
info 0
) Sample data file containing REAL data set sampledata,
) CHARACTER data set samplechars, and structure mystruct
sampledata 4 3 COLUMNS LABELS NOTES ENDED
) Small REAL data set with one missing value coded as -99.
) Each line contains data for one column (COLUMNS on header)
) MISSING -99
) '4x' in the following format skips 4 characters (variable label)
)"4x%f %f %f %f"
Temp 34.5 45.2 23.1 20.1
Conc .170 -99 .883 .401
Secs 3.5 4.7 3.2 5.8
sampledata$LABELS 4 QUOTED COLUMNS
) Labels for sample data in quoted format by columns
) Labels are expanded to "@" "@" "@" "@" "Temp" "Conc" "Secs"
)"%s %s %s %s"
"@" "Temp" "Conc" "Secs"
sampledata$NOTES 1 CHARACTER
) Notes for sampledata in "by line" format
Small REAL data set with one missing value.
%sampledata%
samplechars 2 4 CHARACTER
) 4 by 2 CHARACTER matrix with each row in 2 lines containing
) 3 and 1 unquoted fields
)"%s %s %s"
This is by-fields
format
without any double
quotes
mystruc 2 STRUCTURE
) this is a structure with two components, a and b
) The blank line before the header of each component is required
mystruc$a 2 QUOTED COLUMNS
) character vector of length 2
) Two quoted fields
"The quick brown fox" "Jumps over the lazy dog"
mystruc$b 2 STRUCTURE
) This component is a structure with two components, pi and e
mystruc$b$pi 1 1
) 1 by 1 matrix
3.14159265358979
mystruc$b$e 1
) vector of length 1
2.71828182845905
Examples of reading data sets from data.txt
Cmd> sampledata <- matread("data.txt","sampledata", quiet:T)
Cmd> print(sampledata) # note that labels were read
sampledata:
Temp Conc Secs
(1) 34.5 0.17 3.5
(2) 45.2 MISSING 4.7
(3) 23.1 0.883 3.2
(4) 20.1 0.401 5.8
Cmd> getnotes(sampledata) # notes were retrieved as well as data
(1) "Small REAL data set with one missing value."
Cmd> notes <- matread("data.txt", "sampledata$notes"); notes
(1) "Small REAL data set with one missing value."
Cmd> samplechars <- matread("data.txt","samplechars",quiet:T)
Cmd> print(samplechars)
samplechars:
(1,1) "This"
(1,2) "is"
(1,3) "by-fields"
(1,4) "format"
(2,1) "without"
(2,2) "any"
(2,3) "double"
(2,4) "quotes"
Cmd> mystruc <- matread("data.txt","mystruc",quiet:T)
Cmd> print(mystruc)
mystruc:
component: a
(1) "The quick brown fox"
(2) "Jumps over the lazy dog"
component: b
component: pi
(1,1) 3.1416
component: e
(1) 2.7183
Cmd> mystruc_b <- matread("data.txt", "mystruc$b",quiet:T)
Cmd> print(mystruc_b)
mystruc_b:
component: pi
(1,1) 3.1416
component: e
(1) 2.7183
In these examples, 'quiet:T' suppresses echoing the header line and
comments. See matread().
See also topics matread(), matprint(), matwrite(), 'files',
'macro_files'.
Gary Oehlert
2003-01-15