\documentclass[11pt,twoside,notitlepage]{article}
\usepackage{indentfirst}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{natbib}
\usepackage{url}
\usepackage{graphicx}
\usepackage[utf8]{inputenc}
% remove room for headers, add to textheight
\addtolength{\textheight}{\headheight}
\addtolength{\textheight}{\headsep}
\setlength{\headheight}{0 pt}
\setlength{\headsep}{0 pt}
% adjust so right and left hand pages have different margins
% not sure why these particular numbers were picked or if they
% still make sense
% \showthe\evensidemargin
% \showthe\oddsidemargin
% \showthe\textwidth
% \evensidemargin 15 pt
% \oddsidemargin 61.5 pt
% \textwidth 392.5 pt
\evensidemargin 28.75 pt
\oddsidemargin 75.25 pt
\textwidth 365 pt
%%%%% NEW go with 1.25 inch margin on all sides
\setlength{\textheight}{\paperheight}
\addtolength{\textheight}{- 2 in}
\setlength{\topmargin}{0.25 pt}
\setlength{\headheight}{0 pt}
\setlength{\headsep}{0 pt}
\addtolength{\textheight}{- \topmargin}
\addtolength{\textheight}{- \topmargin}
\addtolength{\textheight}{- \footskip}
\setlength{\oddsidemargin}{0.25 in}
\setlength{\evensidemargin}{0.25 in}
\addtolength{\textwidth}{- \oddsidemargin}
\addtolength{\textwidth}{- \evensidemargin}
\begin{document}
\vspace*{0.9375in}
\begin{center}
{\bfseries More Supporting Data Analysis for \\
``Unifying Life History Analysis for Inference \\
of Fitness and Population Growth''} \\
By \\
Ruth G. Shaw, Charles J. Geyer, Stuart Wagenius, \\
Helen H. Hangelbroek, and Julie R. Etterson \\
Technical Report No.~661 \\
School of Statistics \\
University of Minnesota \\
% April 20, 2005 \\
\today
\end{center}
\thispagestyle{empty}
\cleardoublepage
\setcounter{page}{1}
\thispagestyle{empty}
\begin{abstract}
This technical report (TR) gives details of a data reanalysis backing
up a paper having the same authors as this TR
and having the title that is quoted in the title of this TR. This
reanalysis was not in the first submission of the paper, which instead
had analyses given in Chapters~3 and~4 of TR~658. This analysis is for
the second submission (to the same journal, \emph{American Naturalist})
of that paper.
Unlike the first analyses, these reanalyses directly estimate the fitness
landscape rather than quantities related to it.
The two analyses are also much more alike than the two analyses for the
first submission. Both estimate exactly the same quantities, although
one has to work harder to do so.
In an unrelated issue, we also give an example of subsampling a component
of fitness and its affect on parameter estimates. This issue was mentioned
in the first draft of the paper, but this is the first worked example
illustrating this method.
\end{abstract}
\thispagestyle{empty}
\cleardoublepage
\setcounter{page}{1}
<>=
options(keep.source = TRUE, width = 70)
ps.options(pointsize = 15)
@
\section{Creating this Document}
This document is created from its source file \texttt{tr661.Rnw} using
the R \texttt{Sweave} command and the \LaTeX\ document preparation system.
First do
\begin{verbatim}
Sweave("tr661.Rnw")
\end{verbatim}
if you have downloaded the file, or do
\begin{verbatim}
Sweave(url("http://www.stat.umn.edu/geyer/aster/tr661/tr661.Rnw"))
\end{verbatim}
otherwise.
This step takes an hour and a half on a fairly fast computer because
of the Monte Carlo calculation in Section~\ref{sec:land-sim},
and this step needs to be redone until the statements
\texttt{print(ok)} on pages \pageref{pg:ok2}
and \pageref{pg:ok1} print \texttt{TRUE}.
Then process the output, \texttt{tr661.tex} and several files with
suffixes \texttt{pdf} and \texttt{eps} in the usual fashion (which
depends on your system and installation).
\section{Introduction}
The analysis presented in this technical report is one more attempt
to do full justice to the \emph{Chamaecrista} data described below.
As the experiment was designed there
were multiple components of fitness. For each plant that survived
to that stage, fruits were counted (\texttt{fruit}) and then a random
sample of fruits of size 3 was taken and the seeds in those fruits counted
(\texttt{seed}). This experimental design does not fit aster models perfectly
(not the fault of the experimenters because the experiment was done before
aster models were described). It would have been better if seeds were
counted for all fruits or for a fraction $p$ of fruits.
Nevertheless, we do what we can. Using a Monte Carlo calculation we can
still estimate the fitness surface that corresponds to any aster model
we decide fits the data. We can use the parametric bootstrap to
carry out statistical tests or confidence intervals, although these no
longer have a simple relationship to the parameters of the fitted aster
model (as they would if the experimental design had been more favorable
to aster analysis).
In Section~\ref{sec:both} we perform an aster analysis in which
both components of fitness, \texttt{fruit} and \texttt{seed} are used,
and fitness is deemed to be \verb@fruit * seed / 3@. The multiplication
in this definition complicates estimation of expected fitness. The
aster software can calculate the expectation of any linear combination
of components of fitness, but it cannot calculate expectations of nonlinear
functions of components of fitness. Fortunately, expectations that cannot
be calculated exactly can be approximated by Monte Carlo. This takes time
but is not otherwise problematic.
In Section~\ref{sec:single} we perform an aster analysis in which
\texttt{fruit} is deemed fitness. This illustrates the typical situation
in which a linear combination of fitness components is deemed fitness
and no Monte Carlo calculation is needed.
\section{Analysis involving Both Components of Fitness} \label{sec:both}
\subsection{Data}
We reanalyze a subset of the data analyzed by \citet{es}.
These data are in the \texttt{chamae} dataset in the \texttt{aster}
contributed package to the R statistical computing environment \citep{rcore}.
Individuals of \emph{Chamaecrista fasciculata} (common name, partridge pea)
were obtained from three locations in the country and planted in three field
sites. Of the complete data we only reanalyze here individuals
planted in one field site (Minnesota).
These data are already in ``long'' format, no need to use the \texttt{reshape}
function on them to do aster analysis. We will, however, need the
``wide'' format for Lande-Arnold analysis \citep{la}. So we do that, before
making any changes (we will add newly defined variables) to \texttt{chamae}.
<>=
library(aster)
data(chamae)
chamaew <- reshape(chamae, direction = "wide", timevar = "varb",
v.names = "resp", varying = list(levels(chamae$varb)))
names(chamaew)
@
For each individual, many characteristics were measured, three of which we
consider phenotypic characters (so our $z$ is three-dimensional), and others
which combine to make up an estimate of fitness.
The three phenotypic characters are reproductive stage (\verb@STG1N@),
log leaf number (\verb@LOGLVS@), and log leaf thickness (\verb@LOGSLA@).
``At the natural end of the growing season, [they] recorded total pod number
and seed counts from three representative pods; from these measures, [they]
estimated [fitness]'' \citep[further explained in their note 12]{es}.
% There are several complications in the estimation of fitness.
% Some fruits (pods) had already dehisced by the time data were collected,
% so seeds could not be counted. The number of dehisced fruits are not recorded
% in the data we are working with (although they could be reconstructed from
% the original data). Some individuals had fewer than three (non-dehisced)
% fruits to count.
% An aster model does not allow missing data as opposed to structural zeros,
% which are data that are necessarily zero because other data are zero, for
% example, that dead individuals have no fruits and individuals that have
% zero fruits also have zero seeds. Structural zeros are allowed, but true
% missing data, random variables that are part of the statistical model and
% that can have multiple possible values given the observed data, are not.
% We know how to handle missing data in theory \citep{g,sg}, but this would
% require Monte Carlo likelihood approximation (MCLA), a complication we
% do not wish to introduce here, for which no computer implementation
% currently exists (for aster models, MCLA has been implemented in many
% other contexts).
Although aster model theory in the published version of
\citet{gws} does allow conditionally multinomial response variables,
versions of the \texttt{aster} package up through 0.7-2, the current
version at the time this was written, do not.
Multinomial response, if we could use it, would allow us to deal
individuals having seeds counted from 0, 1, 2, or 3 fruits.
% To avoid the missing data issue, we ignore dehisced fruits, treating
% them for the purposes of this example has having no seeds.
% In contrast, \citet{es} imputed fitness in certain cases.
To avoid multinomial response, we remove individuals with
seeds counted for only one or two fruits (there were only four such).
\begin{figure}
\begin{center}
\setlength{\unitlength}{0.4 in}
\begin{picture}(2.65,2.15)(-2.25,-2.10)
\put(0,0){\makebox(0,0){$1$}}
\put(-1,-1){\makebox(0,0){\ttfamily fecund}}
\put(-2,-2){\makebox(0,0){\ttfamily seed}}
\put(0,-2){\makebox(0,0){\ttfamily fruit}}
\multiput(-0.25,-0.25)(-1,-1){2}{\vector(-1,-1){0.5}}
\multiput(-0.75,-1.25)(1,-1){1}{\vector(1,-1){0.5}}
\end{picture}
\end{center}
\caption{Graph for \emph{Chamaecrista} Aster Data.
Arrows go from parent nodes to child nodes.
Nodes are labeled by their associated variables.
The only root node is associated with the constant variable $1$.
\texttt{fecund} is Bernoulli (zero indicates no seeds, one indicates
nonzero seeds). If \texttt{fecund} is zero, then so are the other variables.
If \texttt{fecund} is nonzero, then \texttt{fruit} (fruit count)
and \texttt{seed} (seed count) are conditionally independent,
\texttt{fruit} has a two-truncated negative binomial distribution,
and \texttt{seed} has a zero-truncated negative binomial distribution.}
\label{fig:graph:chamae}
\end{figure}
Figure~\ref{fig:graph:chamae} shows the graph of the aster model we use
for these
data. Fruit count (\texttt{fruit}) and seed count (\texttt{seed}) are
dependent only in that if one is zero, then so is the other (we only model
fruit count for individuals who have seeds, because fruit count for other
individuals is irrelevant). Given that neither is zero (when
\verb@fecund == 1@), they are conditionally independent.
Given that fruit count is nonzero, it is at least three
(by our data modifications).
The conditional distribution of \texttt{seed} given that it is nonzero
is what is called zero-truncated negative binomial,
which is negative binomial conditioned on being greater than zero.
By analogy we call the conditional distribution of \texttt{fruit} given
that it is nonzero, two-truncated negative binomial,
which is negative binomial conditioned on being greater than two.
\subsection{Aster Analysis}
We need to choose the non-exponential-family parameters (sizes)
for the negative binomial distributions, since the \verb@aster@ package
only does maximum likelihood for exponential family parameters.
We start with the following values, which were chosen with knowledge
of the maximum likelihood estimates for these parameters, which we find
in Section~\ref{sec:mle}. The values that are found then are written
out to a file and loaded here if the file exists, so after several
runs (of \texttt{Sweave}) we are reading in here the maximum likelihood
values of these non-exponential-family parameters.
<>=
options(show.error.messages = FALSE, warn = -1)
try(load("chamae-alpha.rda"))
options(show.error.messages = TRUE, warn = 0)
ok <- exists("alpha.fruit") && exists("alpha.seed")
if (! ok) {
alpha.fruit <- 3.0
alpha.seed <- 15.0
}
print(alpha.fruit)
print(alpha.seed)
@
Then we set up the aster model framework.
<>=
vars <- c("fecund", "fruit", "seed")
pred <- c(0,1,1)
famlist <- list(fam.bernoulli(), fam.poisson(),
fam.truncated.negative.binomial(size = alpha.seed, truncation = 0),
fam.truncated.negative.binomial(size = alpha.fruit, truncation = 2))
fam <- c(1,4,3)
@
We can now fit our first aster model.
<>=
out1 <- aster(resp ~ varb + BLK, pred, fam, varb, id, root,
data = chamae, famlist = famlist)
summary(out1, show.graph = TRUE)
@
The ``response'' \verb@resp@ is a numeric vector containing all the
response variables (\verb@fecund@, \verb@fruit@, and \verb@seed@).
The ``predictor'' \verb@varb@ is a factor with three levels distinguishing
with \verb@resp@ which original response variable an element is.
The predictor \verb@BLK@ has not been mentioned so far.
It is block within the field where the plants were grown.
% Delete as per Ruth criticism
% One might think we should use \verb@varb * BLK@ but this uses up
% too many parameters when we have not yet added the predictors of interest.
% <>=
% out1foo <- aster(resp ~ varb * BLK, pred, fam, varb, id, root,
% data = chamae, famlist = famlist)
% summary(out1foo)
% anova(out1, out1foo)
% @
% Despite the statistically significant improvement (based on the chi-square
% approximation to the log likelihood ratio,
% we do not want to
% use up all our degrees of freedom before we put the predictors of
% interest in the model.
Now we add phenotypic variables.
<>=
out2 <- aster(resp ~ varb + BLK + LOGLVS + LOGSLA + STG1N,
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out2)
@
One might think we should use \verb@varb * (LOGLVS + LOGSLA + STG1N)@
but it turns out
this is too many parameters and the Fisher information is ill conditioned,
as shown by the need to use the \verb@info.tol@ argument.
<>=
out2foo <- aster(resp ~ BLK + varb * (LOGLVS + LOGSLA + STG1N),
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out2foo, info.tol = 1e-11)
anova(out2, out2foo)
@
Despite the statistically significant improvement (based on the chi-square
approximation to the log likelihood ratio, which may not be valid with
such an ill-conditioned Fisher information), we do not adopt this model
(\verb@out2foo@) either.
Although we cannot afford 9 parameters (3 levels of \verb@varb@ times 3
predictor variables) for the interaction, we can afford 6,
only putting the phenotype variables in at level \verb@fruit@ and \verb@seed@.
Because we are fitting an unconditional aster model, the effects of these
terms are passed down to \verb@fecund@.
See the example in \citet{gws} for discussion of this phenomenon.
<>=
foo <- as.numeric(as.character(chamae$varb) == "fruit")
chamae$LOGLVSfr <- chamae$LOGLVS * foo
chamae$LOGSLAfr <- chamae$LOGSLA * foo
chamae$STG1Nfr <- chamae$STG1N * foo
foo <- as.numeric(as.character(chamae$varb) == "seed")
chamae$LOGLVSsd <- chamae$LOGLVS * foo
chamae$LOGSLAsd <- chamae$LOGSLA * foo
chamae$STG1Nsd <- chamae$STG1N * foo
out6 <- aster(resp ~ varb + BLK + LOGLVSfr + LOGSLAfr + STG1Nfr +
LOGLVSsd + LOGSLAsd + STG1Nsd, pred, fam, varb, id, root, data = chamae,
famlist = famlist)
summary(out6)
@
When we analyzed the Minnesota-Minnesota subset alone (the subset of
these data consisting of only the Minnesota population) the
there was no statistically significant effect of the phenotypic predictors
on seed count. In these data that effect is significant.
<>=
out5 <- aster(resp ~ varb + BLK + LOGLVSfr + LOGSLAfr + STG1Nfr,
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out5)
anova(out5, out6)
@
A similar test
<>=
out4 <- aster(resp ~ varb + BLK + LOGLVSsd + LOGSLAsd + STG1Nsd,
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out4)
anova(out4, out6)
@
shows that the effect of these variables on fruit is significant.
Now we consider quadratic terms. Since the variable \verb@STG1N@
has only a few values
<>=
sort(unique(chamae$STG1N))
tabulate(chamae$STG1N)
@
there is little sense adding terms quadratic in this variable.
The test
<>=
out7 <- aster(resp ~ varb + BLK + LOGLVSfr + LOGSLAfr + I(LOGLVSfr^2) +
I(LOGSLAfr^2) + I(LOGLVSfr * LOGSLAfr) + STG1Nfr + LOGLVSsd +
LOGSLAsd + STG1Nsd,
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out7, info.tol = 1e-9)
anova(out6, out7)
@
shows that there appears to be a quadratic effect on fruit.
The similar test
<>=
out8 <- aster(resp ~ varb + BLK + LOGLVSsd + LOGSLAsd + I(LOGLVSsd^2) +
I(LOGSLAsd^2) + I(LOGLVSsd * LOGSLAsd) + STG1Nsd + LOGLVSfr +
LOGSLAfr + STG1Nfr,
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out8, info.tol = 1e-9)
anova(out6, out8)
@
shows that there appears to also be a quadratic effect on seed.
And
<>=
out9 <- aster(resp ~ varb + BLK + LOGLVSfr + LOGSLAfr + I(LOGLVSfr^2) +
I(LOGSLAfr^2) + I(LOGLVSfr * LOGSLAfr) + STG1Nfr + LOGLVSsd + LOGSLAsd +
I(LOGLVSsd^2) + I(LOGSLAsd^2) + I(LOGLVSsd * LOGSLAsd) + STG1Nsd,
pred, fam, varb, id, root, data = chamae, famlist = famlist)
summary(out9, info.tol = 1e-9)
anova(out6, out7, out9)
anova(out6, out8, out9)
@
Shows that the model that is quadratic in the effects on both fruit and
seed is supported by the data. There is some question about these because
the Fisher information is close to singular (as evidenced by our need
to supply the \verb@info.tol@ argument to the \verb@summary@ command),
but we will go with \verb@out9@ as out ``best fitting'' model.
\subsection{Maximum Likelihood Estimation of Size} \label{sec:mle}
The \verb@aster@ function does not calculate the correct likelihood
when the size parameters are considered unknown, because it drops
terms that do not involve the exponential family parameters.
However, the full log likelihood is easily calculated in R.
<>=
x <- out9$x
logl <- function(alpha.fruit, alpha.seed, theta, x) {
x.fecund <- x[ , 1]
theta.fecund <- theta[ , 1]
p.fecund <- 1 / (1 + exp(- theta.fecund))
logl.fecund <- sum(dbinom(x.fecund, 1, p.fecund, log = TRUE))
x.fruit <- x[x.fecund == 1, 2]
theta.fruit <- theta[x.fecund == 1, 2]
p.fruit <- (- expm1(theta.fruit))
logl.fruit <- sum(dnbinom(x.fruit, size = alpha.fruit,
prob = p.fruit, log = TRUE) - pnbinom(2, size = alpha.fruit,
prob = p.fruit, lower.tail = FALSE, log = TRUE))
x.seed <- x[x.fecund == 1, 3]
theta.seed <- theta[x.fecund == 1, 3]
p.seed <- (- expm1(theta.seed))
logl.seed <- sum(dnbinom(x.seed, size = alpha.seed,
prob = p.seed, log = TRUE) - pnbinom(0, size = alpha.seed,
prob = p.seed, lower.tail = FALSE, log = TRUE))
logl.fecund + logl.fruit + logl.seed
}
@
We then calculate the profile likelihood for the two size parameters
(\verb@alpha.fruit@ and \verb@alpha.seed@), maximizing over the other
parameters. Evaluating the profile log likelihood on a grid of points.
We do not do this if the results would be the same as we got last time
and have stored in the variable \verb@logl.seq@.
\label{pg:ok2}
<>=
ok <- exists("alpha.fruit.save") && (alpha.fruit.save == alpha.fruit) &&
exists("alpha.seed.save") && (alpha.seed.save == alpha.seed) &&
exists("coef.save") && isTRUE(all.equal(coef.save, coefficients(out9)))
print(ok)
alpha.fruit.seq <- seq(1.5, 3.5, 0.25)
alpha.seed.seq <- seq(10, 30, 0.5)
if (! ok) {
logl.seq <- matrix(NA, nrow = length(alpha.fruit.seq),
ncol = length(alpha.seed.seq))
for (i in 1:length(alpha.fruit.seq)) {
for (j in 1:length(alpha.seed.seq)) {
famlist.seq <- famlist
famlist.seq[[3]] <- fam.truncated.negative.binomial(size =
alpha.seed.seq[j], truncation = 0)
famlist.seq[[4]] <- fam.truncated.negative.binomial(size =
alpha.fruit.seq[i], truncation = 2)
out9.seq <- aster(out9$formula, pred, fam, varb, id, root,
data = chamae, famlist = famlist.seq, parm = out9$coefficients)
theta.seq <- predict(out9.seq, model.type = "cond",
parm.type = "canon")
dim(theta.seq) <- dim(x)
logl.seq[i, j] <- logl(alpha.fruit.seq[i], alpha.seed.seq[j],
theta.seq, x)
}
}
}
##### interpolate #####
alpha.fruit.interp <- seq(min(alpha.fruit.seq), max(alpha.fruit.seq), 0.01)
alpha.seed.interp <- seq(min(alpha.seed.seq), max(alpha.seed.seq), 0.01)
logl.foo <- matrix(NA, nrow = length(alpha.fruit.interp),
ncol = length(alpha.seed.seq))
for (i in 1:length(alpha.seed.seq))
logl.foo[ , i] <- spline(alpha.fruit.seq, logl.seq[ , i],
n = length(alpha.fruit.interp))$y
logl.bar <- matrix(NA, nrow = length(alpha.fruit.interp),
ncol = length(alpha.seed.interp))
for (i in 1:length(alpha.fruit.interp))
logl.bar[i, ] <- spline(alpha.seed.seq, logl.foo[i, ],
n = length(alpha.seed.interp))$y
imax.fruit <- row(logl.bar)[logl.bar == max(logl.bar)]
imax.seed <- col(logl.bar)[logl.bar == max(logl.bar)]
alpha.fruit.save <- alpha.fruit
alpha.seed.save <- alpha.seed
alpha.fruit <- alpha.fruit.interp[imax.fruit]
alpha.seed <- alpha.seed.interp[imax.seed]
coef.save <- coefficients(out9)
##### save #####
if (! ok) {
save(alpha.fruit, alpha.seed, alpha.fruit.save, alpha.seed.save,
coef.save, logl.seq, file = "chamae-alpha.rda", ascii = TRUE)
}
@
At the end of this chunk we save the maximum likelihood estimates
in a file which is read in at the beginning of this document.
We also save some extra information so there is no need to do this
step every time if there is no change in the alphas.
Figure~\ref{fig:contour:too} (page~\pageref{fig:contour:too})
shows the profile log likelihood for the size parameters.
\begin{figure}
\begin{center}
<