R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > > pdf(file = "gam-plots.pdf") > > # stealing stuff from Chapter 5 of > # Simon Wood (2006) > # Generalized Additive Models: An Introduction with R > # Chapman & Hall/CRC Texts, Boca Raton, FL. > # uses the R package mgcv which is one of the "recommended packages that > # comes with every R installation (no need for install.packages or for > # using the "packages" menu in the GUI's -- it is already installed). > > # note: in this package have multiple ways to specify smooths in formulas > # s(x), s(x, y), te(x, y), ti(x, y), see help for s, te, ti, t2, smooth.terms > > # Cherry trees, see help(trees) > > # using the default, thin-plate splines > > library(mgcv) Loading required package: nlme This is mgcv 1.8-3. For overview type 'help("mgcv-package")'. > data(trees) > out1 <- gam(Volume ~ s(Height) + s(Girth), + family = Gamma(link = log), data = trees) > print(out1) Family: Gamma Link function: log Formula: Volume ~ s(Height) + s(Girth) Estimated degrees of freedom: 1.00 2.42 total = 4.42 GCV score: 0.008082356 > summary(out1) Family: Gamma Link function: log Formula: Volume ~ s(Height) + s(Girth) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.27570 0.01492 219.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Height) 1.000 1.000 31.32 3.92e-06 *** s(Girth) 2.422 3.044 219.28 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.973 Deviance explained = 97.8% GCV = 0.0080824 Scale est. = 0.006899 n = 31 > plot(out1, residuals = TRUE) > > # using cubic regression splines (number of knots fixed ???) > > out2 <- gam(Volume ~ s(Height, bs = "cr") + s(Girth, bs = "cr"), + family = Gamma(link = log), data = trees) > summary(out2) Family: Gamma Link function: log Formula: Volume ~ s(Height, bs = "cr") + s(Girth, bs = "cr") Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.27570 0.01492 219.6 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Height) 1.000 1.000 31.32 3.91e-06 *** s(Girth) 2.419 3.035 219.50 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.973 Deviance explained = 97.8% GCV = 0.0080805 Scale est. = 0.006898 n = 31 > plot(out2, residuals = TRUE) > > predict(out1) - predict(out2) 1 2 3 4 5 1.581844e-04 5.209911e-05 2.180389e-06 4.110952e-05 3.667582e-05 6 7 8 9 10 3.044918e-05 -4.155989e-06 9.920721e-07 -1.355421e-05 -3.068825e-05 11 12 13 14 15 -3.267077e-05 -2.612674e-05 -2.612674e-05 1.915406e-05 3.066144e-05 16 17 18 19 20 -1.212661e-04 -1.151295e-04 -4.974142e-05 -1.663593e-05 -2.422028e-05 21 22 23 24 25 -1.919146e-05 -7.464358e-07 3.340554e-05 -2.581612e-05 3.197808e-05 26 27 28 29 30 1.444099e-04 7.843352e-05 -1.187383e-04 -1.634183e-04 -1.634183e-04 31 2.888061e-04 > # not a lot of difference > > # using 2-D thin plate splines > > out3 <- try(gam(Volume ~ s(Height, Girth), + family = Gamma(link = log), data = trees)) Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom > # humpf, guess we have to follow the book > out3 <- try(gam(Volume ~ s(Height, Girth, k = 25), + family = Gamma(link = log), data = trees)) > # it works, but why do we need k = 25, what is it doing? > summary(out3) Family: Gamma Link function: log Formula: Volume ~ s(Height, Girth, k = 25) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.27583 0.01579 207.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(Height,Girth) 4.791 6.487 161.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.966 Deviance explained = 97.7% GCV = 0.0093576 Scale est. = 0.0077321 n = 31 > plot(out3, residuals = TRUE) > > # is this any better ? > anova(out1, out3) Analysis of Deviance Table Model 1: Volume ~ s(Height) + s(Girth) Model 2: Volume ~ s(Height, Girth, k = 25) Resid. Df Resid. Dev Df Deviance 1 26.578 0.18417 2 25.209 0.19183 1.3688 -0.0076599 > # apparently this is nonsense ????????? > anova(out3) Family: Gamma Link function: log Formula: Volume ~ s(Height, Girth, k = 25) Approximate significance of smooth terms: edf Ref.df F p-value s(Height,Girth) 4.791 6.487 161.5 <2e-16 > # no use > # read p. 195 of Wood book !!!!!!!!!!!!!!!!!!!!!!! > # would have to bootstrap to get P-value, but how ????? > # just getting a valid P-value is an open research question > > # still following book > out4 <- try(gam(Volume ~ te(Height, Girth, k = 5), + family = Gamma(link = log), data = trees)) > summary(out4) Family: Gamma Link function: log Formula: Volume ~ te(Height, Girth, k = 5) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.27584 0.01522 215.2 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value te(Height,Girth) 3 3 377.2 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.973 Deviance explained = 97.7% GCV = 0.0081971 Scale est. = 0.0071821 n = 31 > plot(out4, residuals = TRUE) > > # still following book, new data set, brain image > library(gamair) > data(brain) > brain <- brain[brain$medFPQ > 5e-3, ] > bout0 <- gam(medFPQ ~ s(X, Y, k = 100), data = brain) > summary(bout0) Family: gaussian Link function: identity Formula: medFPQ ~ s(X, Y, k = 100) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.25030 0.03029 41.27 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(X,Y) 86.78 96.45 8.547 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.337 Deviance explained = 37.4% GCV = 1.5207 Scale est. = 1.4353 n = 1564 > gam.check(bout0) Method: GCV Optimizer: magic Smoothing parameter selection converged after 7 iterations. The RMS GCV score gradiant at convergence was 5.235846e-06 . The Hessian was positive definite. The estimated model rank was 100 (maximum possible: 100) Model rank = 100 / 100 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(X,Y) 99.00 86.78 0.82 0 > # looks terrible so go with book. > bout1 <- gam(medFPQ ~ s(X, Y, k = 100), data = brain, + family = Gamma(link = log)) > summary(bout1) Family: Gamma Link function: log Formula: medFPQ ~ s(X, Y, k = 100) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.12031 0.01922 6.258 5.07e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Ref.df F p-value s(X,Y) 60.61 77.82 4.798 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.307 Deviance explained = 26.4% GCV = 0.62169 Scale est. = 0.57802 n = 1564 > gam.check(bout1) Method: GCV Optimizer: outer newton full convergence after 3 iterations. Gradient range [1.056991e-09,1.056991e-09] (score 0.6216871 & scale 0.5780217). Hessian positive definite, eigenvalue range [0.006576209,0.006576209]. Model rank = 100 / 100 Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that k is too low, especially if edf is close to k'. k' edf k-index p-value s(X,Y) 99.000 60.611 0.847 0 > > vis.gam(bout1, plot.type = "contour") > > > proc.time() user system elapsed 4.943 0.082 5.330