Setting Up an Aster Model
Data Sets in the Aster Package
The example data sets in the aster
contributed
package for R are shown in the
aster
package index, which as of version 0.7-4 showed the following data sets
aphid | Life History Data on Uroleucon rudbeckiae |
chamae | Life History Data on Chamaecrista fasciculata |
chamae2 | Life History Data on Chamaecrista fasciculata |
echin2 | Life History Data on Echinacea angustifolia |
echinacea | Life History Data on Echinacea angustifolia |
The echinacea
dataset was analyzed in
the paper Geyer, Wagenius, and Shaw (2007) and complete details of the analysis
are shown in Appendix D of University of Minnesota School of Statistics
Technical Report 644.
All of the other data sets (aphid
, chamae
,
chamae2
, and echin2
) were analyzed in
the paper Shaw, Geyer, Wagenius, Hangelbroek, and Etterson (2008) and complete details of the analysis
are shown in Univ. of Minnesota Sch. of Statistics Technical Reports
658
and
661
and
666.
Here we will follow the analysis of the first paper because it is
- simpler and
- the
echinacea
data set is in rawer form, about the form a typical user would input data into R, whereas the others are more processed, steps we want to illustrate not skip.
Looking at a Data Set
shows the variable names of the variables in the dataset and also their types
(all are integer
except for one that is factor
.
The on-line help describes these variables.
Reading in a Data Set
These data are provided in the source for the R package
as a plain text file
named "echinacea.txt"
in which columns are variables and each column is headed by the variable name.
If this file were your data, you could read it, assuming this file was in the current working directory under linux or the directory set as the working directory using the menu items for the R GUI app on OS X or Windows, as follows
echinacea <- read.table("echinacea.txt", header = TRUE)
Reshaping the Data
The on-line
help for the aster
function shows an analysis using these
data in the examples section.
The first thing we need to do is reshape
the data.
- In the original form we have many response variables.
- In order to analyze the data all of these variables must be strung out into one vector.
- Moreover, each predictor variable must be duplicated so that each component of the new response vector corresponds to each of its predictors.
reshape
(on-line
help) does this job. It is not particularly easy to understand but
one can usually just follow this one example.
The first part (before the comment) does the reshaping. The rest looks as what we have done.
- The data frame
redata
has fewer variables than the original data frameechinacea
, but each is a longer vector. - All of the response variables in
echinacea
are gone. They have been incorporated in the single vectorresp
inredata
. - A crude check of the latter is that the lengths are right:
nrow(redata)
is equal tonrow(echinacea) * length(vars)
. - Besides
resp
, there are three more variables inredata
that are not inechinacea
.- The variable
varb
is afactor
vector which indicates for each component ofresp
which original variable (which column ofechinacea
) the response data came from. - The variable
id
is a numeric vector which indicates for each component ofresp
which original individual (which row ofechinacea
) the response data came from. - The variable
root
, which we explicitly created, has all components equal to one. The purpose ofroot
will be explained in the next section.
- The variable
Specifying an Aster Model
In order to specify an aster model, we need to specify the graph of the
graphical model and the families
, that is the conditional distributions
that correspond to the arrows of the model.
For these data, the graph is Figure 1.1 of TR 644 (p. 6), but the variable names are different. In that TR we used single letter names with subscripts (like math) and in R we use multicharacter variable names (like computer languages). The correspondence is
math name | computer name |
---|---|
M2, …, M4 | ld02 , …, ld04
|
F2, …, F4 | fl02 , …, fl04
|
H2, …, H4 | hdct02 , …, hdct04
|
The conditional distributions are all Bernoulli (binomial with sample size
one) except for the arrows from fl02
to hdct02
and similarly with 2 replaced by 3 and 4 (three arrows).
The following code sets up this graphical model
(again copied from
the on-line
help for the aster
function)
- The
fam
vector says that all theld0
x andfl0
x variables are conditionally Bernoulli given their predecessors. This is because the corresponding element of family is 1, which is the default code for Bernoulli. This default can be changed, if necessary. Thehdct0
x variables are conditionally zero-truncated Poisson given their predecessors. (Same explanation). - The predecessor of the first element of
vars
which isld02
isroot
, because zero in thepred
vector is the code for root, and the corresponding root value is the constant 1 (because we set all elements of theroot
variable in our data frame to be 1), hence this is really an unconditional distribution because it is constant. - The predecessor of the second element of
vars
which isld03
is the first element ofvars
which isld02
, because the second element of thepred
vector is 1 indicating the first element of thevars
vector. - And so forth.
Fitting Aster Models
Now that we are all set up, we are ready to fit some models.
If we were using R these variables would be defined and ready for use.
Because we are using Rweb, which doesn't remember anything between submissions we will have to put everything above in each Rweb form (set up everything each time).
Rather than fit the models in the on-line help for the aster
function, we will fit the models in the Biometrika paper, which
are explained in the paper
and Chapter 1 of
TR 644 (which is the first draft of the paper),
and all of the model fitting is shown in all gory detail in
and Appendix D of
TR 644.
Little Model
We start with the model with formula
resp ~ varb + level:(nsloc + ewloc)
The variables varb
, nsloc
and ewloc
are already in redata
, the variable level
is
defined as follows.
So level
indicates the layer
of the graph, or the
kind of variable (mortality indicator, flowering indicator, head count
indicator) of each component of the response.
which includes the following.
-
varb
, which is afactor
that indicates which original variable (which node of the graph) an element of the response corresponds to. This is tantamount to having a different intercept for each node of the graph. This makes sense because they are different kinds of variables having nothing to do with each other (except for being measurements on the same individual). -
level:(nsloc + ewloc)
, which is the interaction of-
level
, which is afactor
that indicates whichlayer
of the graph an element of the response corresponds. -
nsloc
, which is the north-south location of the individual -
ewloc
, which is the east-west location of the individual
-
The parentheses in the interaction just group.
The term level:(nsloc + ewloc)
indicates the same
thing as the two terms
level:nsloc + level:ewloc
.
The interaction of a factor and a numerical variable (like these)
is the same thing as saying the linear predictor depends linearly on
nsloc
and ewloc
and the linear relationship
has a different slope in each layer
(and also a different intercept
because varb
is in the model).
Enough preliminaries, ready to fit a model!
From looking at the signif stars
it doesn't look like all these
parameters are necessary, but that is a bogus approach to model selection.
A careful investigation detailed in
Appendix D of
TR 644 shows that this is the most reasonable model
that does not include pop
as a predictor, and hence the
sensible smallest model under discussion.
Middle Model
We continue with the model specified in the paper and tech report by formula
resp ~ varb + level:(nsloc + ewloc) + hdct * pop - pop
but is perhaps more simply specified by the formula
resp ~ varb + level:(nsloc + ewloc) + hdct:pop
which involves another new variable hdct
the indicator
of the head count level
created as follows.
Let's try it.
The variable pop
is scientifically important. It is the
prairie remnant from which the individual came. All of the individuals
were grown in a common garden, but grown from seeds that came from different
remnants. The question of interest is whether population of origin has an
effect on fitness. Because we are fitting unconditional aster models
(the default), we want to put pop
in the model only at the
bottom layer of the graphical model, at the head count nodes, the sum of
which is observed fitness (pedantically, the best available surrogate
or proxy for fitness). The best explanation of this issue is
Section 3.2 of
TR 669.
Tests of Model Comparison
Again from looking at the signif stars
it doesn't seem there is
much going on with these populations, but that is not the right way to
do a test of model comparison.
This is the right way.
The anova
command shows the larger model fits considerably
better than the smaller one. Hence population has a statistically
significant effect on fitness (P = 0.01).
Confidence Intervals for Means
We also want to look at mean fitness in the various populations.
We want mean fitness for one typical
individual in each population.
We take a typical
individual to be one having nsloc
and ewloc
equal to zero, which is in the center of the garden.
To predict for these individuals that don't exist, we create a new data frame that gives the data for them. We create response data for them, although this is not used in the prediction. This response data is checked for having possible values, so we make every value equal to one (which is possible for the distributions we are using). We start with a data in wide format and reshape it, just like we did for the real data.
We have new data for seven individuals, one for each remnant population.
Now we have a problem that we don't want to predict the mean for one
response variable, but the sum of three. Fortunately, this problem has
been forseen. The predict
function for aster models, allows
this. (The predict
functions for linear and generalized
linear models should also do this, but don't.)
Unfortunately, we stumble here over one of the really bad design decisions of the programmer (Geyer) of the aster software. In theory, what we want is if ζ is the vector of predicted thingummies (conditional or unconditional means or conditional or unconditional canonical parameters) and A is a given matrix, then we want a prediction of the linear function A ζ.
For some reason that sort of made sense until thought about deeply, the matrix A must be supplied as a three-way array.
- The the first index runs over what would be the row index if A were an actual matrix, that is, it runs over the things being predicted, here fitness for the seven individuals, one representing each population.
- The the second and third indices run over what would be the column index
if A were an actual matrix, that is, they run over elements of
the response vector.
- The the second index runs over individuals.
- The the third index runs over nodes of the graph, labeled by
components of
vars
.