Which model, saturation or triplets, fits the data better? Since this is made-up data, simulated from a saturation model, we know that the saturation process should fit better. But is the difference between the two models enough so that we can tell using statistics? And what statistical procedures do we use?
A standard method of testing nonnested hypotheses was proposed by
Cox (1961, 1962). Kent (1986) compares Cox's test with other procedures.
Here we follow a different notion also suggested by Cox (1962) and followed
up by Atkinson (1970) of embedding the two models in a supermodel that
contains both. This is particularly easy when both models are exponential
families. The supermodel is just the family having the vector canonical
statistic that is the union of the statistics for the two models. The
supermodel for the saturation and triplets models is the exponential family
with the four-dimensional canonical statistic
t(x) = (n(x), s(x), w(x), u(x)). The saturation model is the submodel
obtained by setting
, and the triplets model is
the submodel obtained by setting
.
The first order of business is to find the MLE in the combined model. We start at the MLE in the saturation model, which is (4.212, 0, 0, 0.3626), and do two MCL iterations
giving three to four significant figures. For comparison, the parameter values for the three models are
As expected, the fit in the combined model is much closer to the fit in the saturated model than the fit in the triplets model, when we make the comparison in the canonical parameter space.
We will test saturated versus combined and triplets versus combined. If one null hypothesis can be rejected and the other not, then we declare the null hypothesis that cannot be rejected to be the one that fits. If both null hypotheses can be rejected, then we declare that neither model fits. If neither null hypothesis can be rejected, then we declare that both models fit well, and there is no statistically significant difference between them. This common garden variety statistics in action, but lest the reader think this inference is easy, Figures 1.4 and 1.5 (following pages) invite an attempt at doing the same inference by eye. It's not an easy task without statistics.
Figure: Simulated point patterns from the saturation process and scatter
plot of the distribution of the canonical statistics n(x) and u(x).
Three of the point patterns are simulations from the maximum likelihood model;
the lower right pattern is the observed data (Figure 1.1).
Letters in the scatter plot mark the four patterns, D is the observed data.
Figure: Simulated point patterns from the triplets process and scatter
plot of the distribution of the canonical statistics n(x) and s(x).
Three of the point patterns are simulations from the maximum likelihood model;
the lower right pattern is the observed data (Figure 1.1).
Letters in the scatter plot mark the four patterns, D is the observed data.
We now must collect three new samples using the sampler for the combined
process for each of these parameter values in order to use reverse logistic
regression. We need new samples because we need the four-dimensional canonical
statistic output by the sampler for the combined process.
Collecting samples of size 10000 at spacing 200 and running reverse logistic
regression gives
for the log inverse
normalizing constants of the three distributions with estimated Monte Carlo
error variance
now the log likelihood ratio for two models with parameters
and
and log normalizing constants
and
estimated
by reverse logistic regression is
,
and the MC error variance is estimated by the delta method using
(1.48) and
The deviance, twice the log likelihood ratio, has an asymptotic chi-square distribution if the usual asymptotics hold. Assuming they do hold, the results are
Again we find the result we expected. The saturation model fits the data well. Adding the other two canonical statistics to the model improves the fit no more what one expects whenever two parameters are added to a model. The triplets model does not fit.