Aster Short Course (Summer 2008) Examples: Aster Models

Setting Up an Aster Model

Data Sets in the Aster Package

The example data sets in the aster contributed package for R are shown in the aster package index, which as of version 0.7-4 showed the following data sets

aphid	Life History Data on Uroleucon rudbeckiae
chamae	Life History Data on Chamaecrista fasciculata
chamae2	Life History Data on Chamaecrista fasciculata
echin2	Life History Data on Echinacea angustifolia
echinacea	Life History Data on Echinacea angustifolia

The echinacea dataset was analyzed in the paper Geyer, Wagenius, and Shaw (2007) and complete details of the analysis are shown in Appendix D of University of Minnesota School of Statistics Technical Report 644.

All of the other data sets (aphid, chamae, chamae2, and echin2) were analyzed in the paper Shaw, Geyer, Wagenius, Hangelbroek, and Etterson (2008) and complete details of the analysis are shown in Univ. of Minnesota Sch. of Statistics Technical Reports 658 and 661 and 666.

Here we will follow the analysis of the first paper because it is

simpler and
the echinacea data set is in rawer form, about the form a typical user would input data into R, whereas the others are more processed, steps we want to illustrate not skip.

Looking at a Data Set

shows the variable names of the variables in the dataset and also their types (all are integer except for one that is factor. The on-line help describes these variables.

Reading in a Data Set

These data are provided in the source for the R package as a plain text file named "echinacea.txt" in which columns are variables and each column is headed by the variable name.

If this file were your data, you could read it, assuming this file was in the current working directory under linux or the directory set as the working directory using the menu items for the R GUI app on OS X or Windows, as follows

echinacea <- read.table("echinacea.txt", header = TRUE)

Reshaping the Data

The on-line help for the aster function shows an analysis using these data in the examples section.

The first thing we need to do is reshape the data.

In the original form we have many response variables.
In order to analyze the data all of these variables must be strung out into one vector.
Moreover, each predictor variable must be duplicated so that each component of the new response vector corresponds to each of its predictors.

The R function reshape (on-line help) does this job. It is not particularly easy to understand but one can usually just follow this one example.

The first part (before the comment) does the reshaping. The rest looks as what we have done.

The data frame redata has fewer variables than the original data frame echinacea, but each is a longer vector.
All of the response variables in echinacea are gone. They have been incorporated in the single vector resp in redata.
A crude check of the latter is that the lengths are right: nrow(redata) is equal to nrow(echinacea) * length(vars).
Besides resp, there are three more variables in redata that are not in echinacea.
- The variable varb is a factor vector which indicates for each component of resp which original variable (which column of echinacea) the response data came from.
- The variable id is a numeric vector which indicates for each component of resp which original individual (which row of echinacea) the response data came from.
- The variable root, which we explicitly created, has all components equal to one. The purpose of root will be explained in the next section.

Specifying an Aster Model

In order to specify an aster model, we need to specify the graph of the graphical model and the families, that is the conditional distributions that correspond to the arrows of the model.

For these data, the graph is Figure 1.1 of TR 644 (p. 6), but the variable names are different. In that TR we used single letter names with subscripts (like math) and in R we use multicharacter variable names (like computer languages). The correspondence is

math name	computer name
`M`₂, …, `M`₄	`ld02`, …, `ld04`
`F`₂, …, `F`₄	`fl02`, …, `fl04`
`H`₂, …, `H`₄	`hdct02`, …, `hdct04`

The conditional distributions are all Bernoulli (binomial with sample size one) except for the arrows from fl02 to hdct02 and similarly with 2 replaced by 3 and 4 (three arrows).

The following code sets up this graphical model (again copied from the on-line help for the aster function)

The fam vector says that all the ld0x and fl0x variables are conditionally Bernoulli given their predecessors. This is because the corresponding element of family is 1, which is the default code for Bernoulli. This default can be changed, if necessary. The hdct0x variables are conditionally zero-truncated Poisson given their predecessors. (Same explanation).
The predecessor of the first element of vars which is ld02 is root, because zero in the pred vector is the code for root, and the corresponding root value is the constant 1 (because we set all elements of the root variable in our data frame to be 1), hence this is really an unconditional distribution because it is constant.
The predecessor of the second element of vars which is ld03 is the first element of vars which is ld02, because the second element of the pred vector is 1 indicating the first element of the vars vector.
And so forth.

Fitting Aster Models

Now that we are all set up, we are ready to fit some models.

If we were using R these variables would be defined and ready for use.

Because we are using Rweb, which doesn't remember anything between submissions we will have to put everything above in each Rweb form (set up everything each time).

Rather than fit the models in the on-line help for the aster function, we will fit the models in the Biometrika paper, which are explained in the paper and Chapter 1 of TR 644 (which is the first draft of the paper), and all of the model fitting is shown in all gory detail in and Appendix D of TR 644.

Little Model

We start with the model with formula

resp ~ varb + level:(nsloc + ewloc)

The variables varb, nsloc and ewloc are already in redata, the variable level is defined as follows.

So level indicates the layer of the graph, or the kind of variable (mortality indicator, flowering indicator, head count indicator) of each component of the response.

which includes the following.

varb, which is a factor that indicates which original variable (which node of the graph) an element of the response corresponds to. This is tantamount to having a different intercept for each node of the graph. This makes sense because they are different kinds of variables having nothing to do with each other (except for being measurements on the same individual).
level:(nsloc + ewloc), which is the interaction of
- level, which is a factor that indicates which layer of the graph an element of the response corresponds.
- nsloc, which is the north-south location of the individual
- ewloc, which is the east-west location of the individual

The parentheses in the interaction just group. The term level:(nsloc + ewloc) indicates the same thing as the two terms level:nsloc + level:ewloc.

The interaction of a factor and a numerical variable (like these) is the same thing as saying the linear predictor depends linearly on nsloc and ewloc and the linear relationship has a different slope in each layer (and also a different intercept because varb is in the model).

Enough preliminaries, ready to fit a model!

From looking at the signif stars it doesn't look like all these parameters are necessary, but that is a bogus approach to model selection. A careful investigation detailed in Appendix D of TR 644 shows that this is the most reasonable model that does not include pop as a predictor, and hence the sensible smallest model under discussion.

Middle Model

We continue with the model specified in the paper and tech report by formula

resp ~ varb + level:(nsloc + ewloc) + hdct * pop - pop

but is perhaps more simply specified by the formula

resp ~ varb + level:(nsloc + ewloc) + hdct:pop

which involves another new variable hdct the indicator of the head count level created as follows.

Let's try it.

Dataset URL

The variable pop is scientifically important. It is the prairie remnant from which the individual came. All of the individuals were grown in a common garden, but grown from seeds that came from different remnants. The question of interest is whether population of origin has an effect on fitness. Because we are fitting unconditional aster models (the default), we want to put pop in the model only at the bottom layer of the graphical model, at the head count nodes, the sum of which is observed fitness (pedantically, the best available surrogate or proxy for fitness). The best explanation of this issue is Section 3.2 of TR 669.

Tests of Model Comparison

Again from looking at the signif stars it doesn't seem there is much going on with these populations, but that is not the right way to do a test of model comparison.

This is the right way.

Dataset URL

The anova command shows the larger model fits considerably better than the smaller one. Hence population has a statistically significant effect on fitness (P = 0.01).

Confidence Intervals for Means

We also want to look at mean fitness in the various populations.

We want mean fitness for one typical individual in each population. We take a typical individual to be one having nsloc and ewloc equal to zero, which is in the center of the garden.

To predict for these individuals that don't exist, we create a new data frame that gives the data for them. We create response data for them, although this is not used in the prediction. This response data is checked for having possible values, so we make every value equal to one (which is possible for the distributions we are using). We start with a data in wide format and reshape it, just like we did for the real data.

R statements library(aster) data(echinacea) vars <- c("ld02", "ld03", "ld04", "fl02", "fl03", "fl04", "hdct02", "hdct03", "hdct04") redata <- reshape(echinacea, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") redata <- data.frame(redata, root = 1) pred <- c(0, 1, 2, 1, 2, 3, 4, 5, 6) fam <- c(1, 1, 1, 1, 1, 1, 3, 3, 3) level <- gsub("[0-9]", "", as.character(redata$varb)) redata <- data.frame(redata, level = as.factor(level)) hdct <- grep("hdct", as.character(redata$varb)) hdct <- is.element(seq(along = redata$varb), hdct) redata <- data.frame(redata, hdct = as.integer(hdct)) #### end of previously explained setup ##### newdata <- data.frame(pop = levels(echinacea$pop)) for (v in vars) newdata[[v]] <- 1 newdata$root <- 1 newdata$ewloc <- 0 newdata$nsloc <- 0 renewdata <- reshape(newdata, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") level <- gsub("[0-9]", "", as.character(renewdata$varb)) renewdata <- data.frame(renewdata, level = as.factor(level)) hdct <- grep("hdct", as.character(renewdata$varb)) hdct <- is.element(seq(along = renewdata$varb), hdct) renewdata <- data.frame(renewdata, hdct = as.integer(hdct)) #### end of new setup ##### names(redata) names(renewdata) unique(renewdata$id) levels(echinacea$pop)

Dataset URL

We have new data for seven individuals, one for each remnant population.

Now we have a problem that we don't want to predict the mean for one response variable, but the sum of three. Fortunately, this problem has been forseen. The predict function for aster models, allows this. (The predict functions for linear and generalized linear models should also do this, but don't.)

Unfortunately, we stumble here over one of the really bad design decisions of the programmer (Geyer) of the aster software. In theory, what we want is if ζ is the vector of predicted thingummies (conditional or unconditional means or conditional or unconditional canonical parameters) and A is a given matrix, then we want a prediction of the linear function A ζ.

For some reason that sort of made sense until thought about deeply, the matrix A must be supplied as a three-way array.

The the first index runs over what would be the row index if A were an actual matrix, that is, it runs over the things being predicted, here fitness for the seven individuals, one representing each population.
The the second and third indices run over what would be the column index if A were an actual matrix, that is, they run over elements of the response vector.
- The the second index runs over individuals.
- The the third index runs over nodes of the graph, labeled by components of vars.

R statements library(aster) data(echinacea) vars <- c("ld02", "ld03", "ld04", "fl02", "fl03", "fl04", "hdct02", "hdct03", "hdct04") redata <- reshape(echinacea, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") redata <- data.frame(redata, root = 1) pred <- c(0, 1, 2, 1, 2, 3, 4, 5, 6) fam <- c(1, 1, 1, 1, 1, 1, 3, 3, 3) level <- gsub("[0-9]", "", as.character(redata$varb)) redata <- data.frame(redata, level = as.factor(level)) hdct <- grep("hdct", as.character(redata$varb)) hdct <- is.element(seq(along = redata$varb), hdct) redata <- data.frame(redata, hdct = as.integer(hdct)) newdata <- data.frame(pop = levels(echinacea$pop)) for (v in vars) newdata[[v]] <- 1 newdata$root <- 1 newdata$ewloc <- 0 newdata$nsloc <- 0 renewdata <- reshape(newdata, varying = list(vars), direction = "long", timevar = "varb", times = as.factor(vars), v.names = "resp") level <- gsub("[0-9]", "", as.character(renewdata$varb)) renewdata <- data.frame(renewdata, level = as.factor(level)) hdct <- grep("hdct", as.character(renewdata$varb)) hdct <- is.element(seq(along = renewdata$varb), hdct) renewdata <- data.frame(renewdata, hdct = as.integer(hdct)) #### end of previously explained setup ##### nind <- nrow(newdata) nnode <- length(vars) amat <- array(0, c(nind, nnode, nind)) for (i in 1:nind) amat[i , grep("hdct", vars), i] <- 1 #### end of new setup ##### out2 <- aster(resp ~ varb + level:(nsloc + ewloc) + hdct:pop, pred, fam, varb, id, root, data = redata) pout2 <- predict(out2, varvar = varb, idvar = id, root = root, newdata = renewdata, se.fit = TRUE, amat = amat) print(pout2)

Dataset URL

Aster Short Course (Summer 2008) Examples: Aster Models

Setting Up an Aster Model

Data Sets in the Aster Package

Looking at a Data Set

Reading in a Data Set

Reshaping the Data

Specifying an Aster Model

Fitting Aster Models

Little Model

Middle Model

Tests of Model Comparison

Confidence Intervals for Means

Navigation

Contents