Setting Up an Aster Model

Data Sets in the Aster Package

The example data sets in the aster contributed package for R are shown in the aster package index, which as of version 0.7-4 showed the following data sets

aphid Life History Data on Uroleucon rudbeckiae
chamae Life History Data on Chamaecrista fasciculata
chamae2 Life History Data on Chamaecrista fasciculata
echin2 Life History Data on Echinacea angustifolia
echinacea Life History Data on Echinacea angustifolia

The echinacea dataset was analyzed in the paper Geyer, Wagenius, and Shaw (2007) and complete details of the analysis are shown in Appendix D of University of Minnesota School of Statistics Technical Report 644.

All of the other data sets (aphid, chamae, chamae2, and echin2) were analyzed in the paper Shaw, Geyer, Wagenius, Hangelbroek, and Etterson (2008) and complete details of the analysis are shown in Univ. of Minnesota Sch. of Statistics Technical Reports 658 and 661 and 666.

Here we will follow the analysis of the first paper because it is

  1. simpler and
  2. the echinacea data set is in rawer form, about the form a typical user would input data into R, whereas the others are more processed, steps we want to illustrate not skip.

Looking at a Data Set

shows the variable names of the variables in the dataset and also their types (all are integer except for one that is factor. The on-line help describes these variables.

Reading in a Data Set

These data are provided in the source for the R package as a plain text file named "echinacea.txt" in which columns are variables and each column is headed by the variable name.

If this file were your data, you could read it, assuming this file was in the current working directory under linux or the directory set as the working directory using the menu items for the R GUI app on OS X or Windows, as follows

echinacea <- read.table("echinacea.txt", header = TRUE)

Reshaping the Data

The on-line help for the aster function shows an analysis using these data in the examples section.

The first thing we need to do is reshape the data.

  1. In the original form we have many response variables.
  2. In order to analyze the data all of these variables must be strung out into one vector.
  3. Moreover, each predictor variable must be duplicated so that each component of the new response vector corresponds to each of its predictors.
The R function reshape (on-line help) does this job. It is not particularly easy to understand but one can usually just follow this one example.

The first part (before the comment) does the reshaping. The rest looks as what we have done.

Specifying an Aster Model

In order to specify an aster model, we need to specify the graph of the graphical model and the families, that is the conditional distributions that correspond to the arrows of the model.

For these data, the graph is Figure 1.1 of TR 644 (p. 6), but the variable names are different. In that TR we used single letter names with subscripts (like math) and in R we use multicharacter variable names (like computer languages). The correspondence is

math name computer name
M2, …, M4 ld02, …, ld04
F2, …, F4 fl02, …, fl04
H2, …, H4 hdct02, …, hdct04

The conditional distributions are all Bernoulli (binomial with sample size one) except for the arrows from fl02 to hdct02 and similarly with 2 replaced by 3 and 4 (three arrows).

The following code sets up this graphical model (again copied from the on-line help for the aster function)

Fitting Aster Models

Now that we are all set up, we are ready to fit some models.

If we were using R these variables would be defined and ready for use.

Because we are using Rweb, which doesn't remember anything between submissions we will have to put everything above in each Rweb form (set up everything each time).

Rather than fit the models in the on-line help for the aster function, we will fit the models in the Biometrika paper, which are explained in the paper and Chapter 1 of TR 644 (which is the first draft of the paper), and all of the model fitting is shown in all gory detail in and Appendix D of TR 644.

Little Model

We start with the model with formula

resp ~ varb + level:(nsloc + ewloc)

The variables varb, nsloc and ewloc are already in redata, the variable level is defined as follows.

So level indicates the layer of the graph, or the kind of variable (mortality indicator, flowering indicator, head count indicator) of each component of the response.

which includes the following.

The parentheses in the interaction just group. The term level:(nsloc + ewloc) indicates the same thing as the two terms level:nsloc + level:ewloc.

The interaction of a factor and a numerical variable (like these) is the same thing as saying the linear predictor depends linearly on nsloc and ewloc and the linear relationship has a different slope in each layer (and also a different intercept because varb is in the model).

Enough preliminaries, ready to fit a model!

From looking at the signif stars it doesn't look like all these parameters are necessary, but that is a bogus approach to model selection. A careful investigation detailed in Appendix D of TR 644 shows that this is the most reasonable model that does not include pop as a predictor, and hence the sensible smallest model under discussion.

Middle Model

We continue with the model specified in the paper and tech report by formula

resp ~ varb + level:(nsloc + ewloc) + hdct * pop - pop

but is perhaps more simply specified by the formula

resp ~ varb + level:(nsloc + ewloc) + hdct:pop

which involves another new variable hdct the indicator of the head count level created as follows.

Let's try it.

The variable pop is scientifically important. It is the prairie remnant from which the individual came. All of the individuals were grown in a common garden, but grown from seeds that came from different remnants. The question of interest is whether population of origin has an effect on fitness. Because we are fitting unconditional aster models (the default), we want to put pop in the model only at the bottom layer of the graphical model, at the head count nodes, the sum of which is observed fitness (pedantically, the best available surrogate or proxy for fitness). The best explanation of this issue is Section 3.2 of TR 669.

Tests of Model Comparison

Again from looking at the signif stars it doesn't seem there is much going on with these populations, but that is not the right way to do a test of model comparison.

This is the right way.

The anova command shows the larger model fits considerably better than the smaller one. Hence population has a statistically significant effect on fitness (P = 0.01).

Confidence Intervals for Means

We also want to look at mean fitness in the various populations.

We want mean fitness for one typical individual in each population. We take a typical individual to be one having nsloc and ewloc equal to zero, which is in the center of the garden.

To predict for these individuals that don't exist, we create a new data frame that gives the data for them. We create response data for them, although this is not used in the prediction. This response data is checked for having possible values, so we make every value equal to one (which is possible for the distributions we are using). We start with a data in wide format and reshape it, just like we did for the real data.

We have new data for seven individuals, one for each remnant population.

Now we have a problem that we don't want to predict the mean for one response variable, but the sum of three. Fortunately, this problem has been forseen. The predict function for aster models, allows this. (The predict functions for linear and generalized linear models should also do this, but don't.)

Unfortunately, we stumble here over one of the really bad design decisions of the programmer (Geyer) of the aster software. In theory, what we want is if ζ is the vector of predicted thingummies (conditional or unconditional means or conditional or unconditional canonical parameters) and A is a given matrix, then we want a prediction of the linear function A ζ.

For some reason that sort of made sense until thought about deeply, the matrix A must be supplied as a three-way array.