--- title: "Stat 3701 Lecture Notes: Matrices, Arrays, and Data Frames in R" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: number_sections: true mathjax: "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" pdf_document: number_sections: true --- # License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/). # R * The version of R used to make this document is `r getRversion()`. * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`. * The version of the `Matrix` package used to make this document is `r packageVersion("Matrix")`. # Matrices The plural of *matrix* is *matrices* (it's Latin originally; this is like the plurals of *alumnus* and *alumna* being *alumni* and *alumnae*, respectively, or the plural of *datum* being *data*). R, like MATLAB, can be considered a matrix language. It understands matrices better than most other computer languages. It has syntax especially for matrices and a lot of functions that understand matrices. The recommended package `Matrix` (recommended packages come with every R installation) understands even more about matrices. Here is a matrix ```{r} fred <- matrix(1:6, nrow = 3, ncol = 2) fred ``` From this example we see one very important thing to know about R. In R matrices, the components are stored in FORTRAN order (first index goes the fastest, rather than C or C++ order). This is the order R stuffs a vector into a matrix (as we see above). If you don't want that order, there is an optional argument for that. ```{r} matrix(1:6, ncol = 2, byrow = TRUE) ``` (we omitted the argument `nrow` because R can figure out that it must be the length of the first argument divided by `ncol`). Also, as we have seen before, R uses one-origin indexing like FORTRAN rather than zero-origin indexing like C and C++. Storage order is important when you want to run a matrix through a function that doesn't respect matrices. ```{r} as.character(fred) ``` Oops! Lost our matrix information. ```{r} frank <- matrix(as.character(fred), nrow = nrow(fred), ncol = ncol(fred)) frank ``` But there is actually a better, less obvious way ```{r} storage.mode(fred) <- "character" fred ``` # Matrix Operations ## Multiplication ### Matrix times Matrix Matrix multiplication is different from ordinary multiplication. If $A$ and $B$ are matrices, then $C = A B$ is only defined if the column dimension of $A$ is the same as the row dimension of $B$, in which case if we denote the components of $A$ by $a_{i j}$ and similarly for $B$ and $C$ $$ c_{i j} = \sum_j a_{i j} b_{j k}, $$ where we deliberately leave the bounds of the summation on $j$ unspecified, the intention being that it does over all possible values, which, since $j$ is the column index of $A$ and the row index of $B$ must be the column dimension of $A$ and the row dimension of $B$, which hence must be the same (as we already said above). ### Non-Commutative The commutative law of multiplication says $x y = y x$ for all numbers $x$ and $y$. It does not apply to matrix multiplication. If $A$ and $B$ are matrices, then * $A B$ is only defined when the column dimension of $A$ is the same as the row dimension of $B$, and * $B A$ is only defined when the column dimension of $B$ is the same as the row dimension of $A$. So it may be that either, both, or neither of $A B$ and $B A$ are defined. And, even when both are defined, it may be that $A B \neq B A$. ### Matrix times Vector Vectors are considered as matrices with one column in order that multiplication of matrices and vectors make sense. This is a stupid mathematician's trick. Vectors are not really matrices. We only consider them as one-column matrices (called *column vectors* to emphasize we are doing this trick) in order to be able to use our previous definition of matrix multiplication here too. ### Linear Functions Abstractly, matrix multiplication corresponds to vector-to-vector linear functions. If $A$ is a matrix with $r$ rows and $c$ columns, then the function $f$ defined by $$ f(x) = A x, \qquad \text{for all vectors $x$ of dimension $c$}, $$ maps vectors whose dimension is $c$ to vectors whose dimension is $r$. A function like this is called a *linear function* by mathematicians. ### Affine Functions Again suppose $A$ is a matrix with $r$ rows and $c$ columns. And suppose $b$ is a vector of dimension $r$. The function $g$ defined by $$ g(x) = b + A x, \qquad \text{for all vectors $x$ of dimension $c$}, $$ also maps vectors whose dimension is $c$ to vectors whose dimension is $r$. A function like this is called an *affine function* by mathematicians. Clearly every linear function is an affine function, but an affine function defined by the displayed equation in this section is a linear function if and only if $b = 0$. ### Mathematicians versus Everybody Else Everybody besides mathematicians, and even mathematics teachers teaching courses below the level of the course called *linear algebra* (which is not a pre-requisite for this course) use different terminology. * When mathematicians say *affine function* everybody else says *linear function*. * When mathematicians say *linear function* everybody else says *linear function* too. So everybody else does not distinguish between the two kinds of functions. If we specialize to scalar-to-scalar functions (so $A$ is a one-by-one matrix and $b$ is a vector of length one) we get $$ f(x) = a x $$ $$ g(x) = b + a x $$ and it is the latter that everybody calls *linear function* with *slope* $a$ and *intercept* $b$. And the former is just the case when the intercept is zero, and that gets no special terminology. ### Linear Models Except everybody else is sloppy. In the term *linear models*, the statistical models fit by the R function `lm`, the "linear" means the mathematicians linear rather than everybody else's linear. These are models where the mean vector of the response vector has the form $$ \mu = M \beta, $$ where $M$ is the so-called *model matrix* (a function of predictor data) and $\beta$ is the so-called *coefficients* vector. And this is a linear function in the mathematician's sense. ### Composition Matrix multiplication corresponds to the *composition* of linear (in the mathematician's sense) functions. If $$ f(x) = A x $$ $$ g(x) = B x $$ the the *composition* of $f$ and $g$, written $g \circ f$, is the function $h$ defined by $$ h(x) = g(f(x)) = B A x, $$ that is, * $A$ is the matrix that corresponds to the linear function $f$, * $B$ is the matrix that corresponds to the linear function $g$, and * $B A$ is the matrix that corresponds to the linear function $h = g \circ f$. Of course, the composition is only defined when the matrix multiplication is (when the column dimension of $B$ is the same as the row dimension of $A$). ### R After that long theoretical digression we can get back to computing. The R operator for matrix multiplication is `%*%`. ```{r error=TRUE} fred <- matrix(1:6, ncol = 2) sally <- matrix(1:8, nrow = 2) fred sally fred %*% sally sally %*% fred ``` If you check it out, you will find that the R results agree with the mathematical definition. R vectors are not matrices. ```{r} lou <- 1:2 is.matrix(lou) ``` but ```{r} fred %*% lou ``` works. If it bothers you that R is doing something tricky here, guessing that you want `lou` to be treated as a column vector (a one-column matrix) even though it really isn't, you can say ```{r} fred %*% cbind(lou) ``` instead (more on the R function `cbind` below). ## Addition If $A$ and $B$ are matrices having the same dimensions, then $A + B$ is defined by componentwise addition. Here mathematics and R agree. ```{r} herman <- matrix(-(1:6), ncol = 2) herman herman + fred ``` But what does R do when the dimensions are not the same? ```{r error = TRUE} fred + sally ``` Oh! It does the right thing. R and math agree here, as they did for matrix multiplication. ## Scalar Multiplication This too is defined componentwise. ```{r} 3 * fred ``` ## Matrices are Really Vectors Mathematically, any thingummies that can be added and multiplied by scalars *are* vectors (provided that they satisfy the axioms for vector spaces, which are found near the beginning of any linear algebra book, but which we don't need to fuss about here or anywhere in this course). Of course, matrices are rather special vectors because, unlike other vectors, matrices can also be multiplied (sometimes, depending on dimensions). The collection of all square matrices (meaning row and column dimensions are the same) of the same dimension is called a *linear algebra*. It is the mathematical structure that gives courses called "linear algebra" their name. These objects can be added, multiplied by scalars, and also multiplied by each other. # Special Binary Operators We take a break from discussing matrices to discuss "special" binary operators. In R these are kind of like operator overloading in C++ but much easier to use, much easier to create, and much more powerful. In C++ you can only overload the operators that C++ already has. This leads to highly unintuitive overloadings like overloading the shift operators `>>` and `<<` to mean something in the I/O (input-output) context. In R you can define new binary operators: `%foo%` where `foo` is replaced by any valid R name, is a new operator. You define it this way ```{r} "%foo%" <- function(x, y) 42 # or whatever fred %foo% sally ``` The quotation marks in the function definition are necessary, because without them R would treat `%foo%` as a binary operator (that has already been defined), which would mean that it would have to have arguments on either side, and it doesn't. So without quotation marks this would be an error. There are a bunch of these built into R of which we have already met one: `%*%` the matrix multiplication operator. Here are all in the R core. | function | package | what it does | other functions | | ----- |----- | ----- | ----- | | `%/%` | Primitive | integer division | | | `%%` | Primitive | remainder from integer division | | | `%*%` | Primitive | matrix multiplication | `crossprod` and `tcrossprod` | | `%in%` | base | set-theoretic element of operator | `is.element` | | `%o%` | base | outer product of vectors | `outer` | | `%x%` | base | Kronecker product of matrices | `kronecker` | "Primitive" isn't an R package, but rather lower level. They are the basic functions on which all other R functions are built. Their R definitions look like ```{r} get("%*%") ``` "Other functions" refers to other functions that do exactly the same thing or almost the same thing. ```{r} c("red", "white", "blue", "chartreuse", "puce", "aqua") %in% colors() is.element(c("red", "white", "blue", "chartreuse", "puce", "aqua"), colors()) ``` Since the other "other functions" involve matrices, we are back to talking about matrices. # More on Matrices ## Transpose The *transpose* of a matrix having components $a_{i j}$ is the matrix having components $a_{j i}$, that is, transposing a matrix turns rows into columns and vice versa. The R function that does this is called `t`. ```{r} fred t(fred) ``` The mathematical notation for the transpose of the matrix $A$ is $A^T$ or $A'$ (some people like one, some the other). ## Outer Product Now we can explain the outer product of vectors or matrices. The outer product of $x$ and $y$ is $x y^T$ (in math) or `x %*% t(y)` (in R). This always is defined for numeric vectors, but may or may not be defined for matrices (it depends on their dimensions). ```{r} 1:3 %o% 1:5 outer(1:3, 1:5) 1:3 %*% t(1:5) ``` But, unlike `%o%` the function `outer` has an optional argument that allows one to change the arithmetic operation. The outer product of the vector having components $x_i$ and the vector having components $y_j$ is the matrix having components $x_i y_j$. But the R function `outer` allows you to change the multiplication to any binary operation. ```{r error=TRUE} outer(1:3, 1:5, "+") outer(1:3, 1:5, max) ``` Ouch! That is a particularly user-hostile error message. Usually I can understand R error messages, but not that one. Looking at the examples for the help page for `outer` shows we need quotation marks. ```{r error=TRUE} outer(1:3, 1:5, "max") ``` Oops! That's not the issue. Try three. ```{r error=TRUE} outer(1:3, 1:5, "pmax") ``` Oh! The functions have to be vectorizing, so `pmax` instead of `max`. (See their help pages for the difference.) So the big difference between `%o%` and `outer` is that the latter is a higher-order function, and hence way more powerful. ## Kronecker Product We're not even going to explain this, beyond the scope of this course. ## Dimension R matrices, like other R objects know what they are ```{r} is.matrix(fred) dim(fred) nrow(fred) ncol(fred) ``` So `nrow(fred)` is short for `dim(fred)[1]` and `ncol(fred)` is short for `dim(fred)[2]`. Rows and columns can also have names. ```{r} rownames(fred) <- c("red", "white", "blue") colnames(fred) <- c("car", "truck") fred ``` ## Indexing ### Two Indices The plural of *index* is *indices* (it's Latin originally; this is like *matrix* and *matrices*). Since matrices have two indices they have two slots for subscripts ```{r} fred fred[1, 1] fred[1, ] fred[ , 1] fred[ , ] ``` But of course, all of these vectorize, and can be any of the four types of indexing that R understands (positive integers, negative integers, logical, or character). ```{r} fred[1:2, 1] fred[-1, -1] fred[1:nrow(fred) <= 2, 1] fred["red", "car"] ``` ### The Square Brackets Function is a Footgun The R function named `"["` is a footgun when applied to matrices (and arrays, which we haven't covered yet) because sometimes it produces a vector and sometimes a matrix (and sometimes an array), and sometimes you know what it is going to do and sometimes you don't. It is almost always always wrong, when `fred` is a matrix whose dimensions you do not know and `i` and `j` are vectors specifying indices, to say `fred[i, j]` because you do not know whether it will return a vector or a matrix. If it doesn't matter whether the result is a vector or a matrix, then maybe you're OK. But how do you know the result doesn't matter? How do you know everything you may ever do with the result? Especially when you are writing functions that others may use, the square brackets function is a footgun. Two-thirds of the time or more, when I say `fred[i, j]` inside a function I am writing, it is a source of bugs. So you can do it if you want, but don't say I didn't warn you. The way to get a result that you can depend on is to say `fred[i, j, drop = FALSE]` which does not "drop" the matrixness when the result has one of the dimensions equal to one. ```{r} fred[1:2, 1] fred[1:2, 1, drop = FALSE] ``` (the default value of `drop` is `TRUE`). Summary: Always use `drop = FALSE` when you are inside a function or whenever you are not *sure* it makes no difference in what you are doing. ### One Index You can always use one index with matrices. But it may or may not do what you want. ```{r} fred[1:4] ``` But sometimes it is useful. The R functions `row` and `col` tell us the row and column indices of each position in a matrix. ```{r} row(fred) col(fred) ``` and they allow us to do rather complicated things. ```{r} alfie <- fred %*% sally alfie ``` Note that we have row names inherited from `fred` but no column names (they would have been inherited from `sally` if it had had any, but it didn't). ```{r} alfie[abs(row(alfie) - col(alfie)) > 1] <- 0 alfie ``` Here we use the square brackets function on the left-hand side of an assignment (so this is really using the function called `"[<-"`) and we are also using logical indexing. This could also have been done (TIMTOWTDI) ```{r} library(Matrix) alphie <- band(fred %*% sally, -1, 1) alphie ``` But I, being an old fart, who has been using R since long before the `Matrix` package existed, prefer the former, even though the former is *not functional programming* and the latter is. Also they are not quite the same. ```{r} class(alfie) class(alphie) identical(alfie, as.matrix(alphie)) ``` ### One-Index Example: Hodges-Lehmann Estimator A *Hodges-Lehmann* estimator associated with a procedure for making confidence intervals is the point one gets by letting the confidence level go to zero. Sometimes this is already a well-known estimator. The Hodges-Lehmann estimator associated with $t$ confidence intervals is the sample mean. Sometimes it is something new. The Hodges-Lehmann estimator associated with the confidence intervals associated with the Wilcoxon signed rank test is the median of the Walsh averages, which are the averages of pairs $$ \frac{X_i + X_j}{2}, \qquad 1 \le i \le j \le n, $$ where the data points are $X_i$, $i = 1$, $\ldots,$ $n$. Note that $i = j$ is allowed. We can make these as follows. First we need some data to try it out on ```{r} set.seed(42) x <- rcauchy(50) ``` and then we can make the Walsh averages as follows ```{r} o <- outer(x, x, "+") / 2 w <- o[row(o) <= col(o)] ``` there is another way (TIMTOWTDI) ```{r} w2 <- o[upper.tri(o, diag = TRUE)] identical(w, w2) ``` (there is also a function `lower.tri` to extract lower triangles). Then the Hodges-Lehmann estimator is ```{r} median(w) ``` ## Solving Linear Equations The equation $A x = b$, where $x$ and $b$ are vectors (not necessarily of the same dimension) and $A$ is a matrix, can also be thought of as a set of linear equations $$ \sum_j a_{i j} x_j = b_i, \qquad i = 1, \ldots, n. $$ Suppose $A$ and $b$ are known and $x$ is the unknown to be solved for. Then we know from somewhere or other (high-school?) that if the number of unknowns is the same as the number of equations there is (maybe) a solution, and we have been taught how to find it by hand. R can do that faster. ```{r error=TRUE} barbie <- matrix(1:9, 3) ken <- 10:12 solve(barbie, ken) eigen(barbie) ``` Hmmmm. That example didn't show what we wanted it to show, but it does show that R is smart enough to detect when the equations *do not* have a solution. We don't want to explain eigenvalues and eigenvectors here (more beyond the scope of this course), but they do show the reason that the equations have no solution is because ```{r} barbie[1, ] + barbie[3, ] - 2 * barbie[2, ] ``` is zero, so some of the equations are linear combinations of the others. Try again. ```{r error=TRUE} barbie[3, 3] <- 20 barbie ken solve(barbie, ken) ``` and this is indeed the solution ```{r error=TRUE} x <- solve(barbie, ken) identical(barbie %*% x, ken) ``` Oops! That didn't work because the attributes are different. `barbie %*% x` is a matrix and `ken` is a vector. ```{r error=TRUE} identical(as.vector(barbie %*% x), ken) ``` It still didn't work. Why? ```{r} as.vector(barbie %*% x) - ken typeof(barbie %*% x) typeof(ken) ``` Oh! They don't compare equal because the attributes don't match. ```{r} identical(as.numeric(barbie %*% x), as.numeric(ken)) ``` **Warning!** Never compare numeric variables for equality and expect that you have to get the answer you expect. It would have been much safer to say ```{r} all.equal(as.numeric(barbie %*% x), as.numeric(ken)) ``` The R function `identical` insists on exact bit-for-bit equality. This is unlikely to be the case when the objects are the results of complicated calculations. The R function `all.equal` compares with a tolerance. It says they are equal if they are *almost equal*. More on this later when we cover computer arithmetic. ### Matrix inverse Division of matrices is not defined, but some square matrices do have inverses. The inverse of the matrix $A$ (if it exists) is written $A^{- 1}$. and when it exists we have $A x = b$ if and only if $x = A^{-1} b$. The R function `solve` also calculates matrix inverses, when the second argument is missing. (So here is another example, where R could not do this unless the R function named `missing` existed. It is also an example of R's rather bizarre use of what C++ would call method overloading. Why not just use a different name for the function that calculates inverses?) ```{r} barbie.inverse <- solve(barbie) barbie.inverse barbie.inverse %*% ken solve(barbie, ken) ``` You should rarely, if ever, calculate a matrix inverse. If you want to solve linear equations, use the R function `solve` with two arguments rather than `solve` with one argument followed by a matrix multiplication. The former is 1. more accurate (as we see above), and 2. faster. Summary: don't use matrix inverse; use solution of linear equations instead. ### Identity Matrix The matrix that "does nothing" in matrix multiplication, like zero "does nothing" in addition of scalars and one "does nothing" in multiplication of scalars is called the *identity matrix*. If $A$ is a square matrix and $I$ is the identity matrix of the same dimension, then $A I = I A = A$. And $B$ is the matrix inverse of $A$ if and only if $A B = B A = I$. The identity matrix is created by the R function `diag`. ```{r} diag(3) ``` This is a highly unusual usage of this function. Usually it is used to extract or assign to the diagonal of a matrix (see the help page). (This is also an example where perhaps two different function names should have been used, but it's too late to fix that now. This remark is here because Ansu Chatterjee said this in casual conversation.) ## R Package Matrix This is a "recommended package" meaning it comes with every R installation. It is especially useful in dealing with *sparse* matrices (those with most components equal to zero). It can be vastly more efficient dealing with sparse matrices than the rest of R is. But all of that is beyond the scope of this course. ## More Higher-Order Functions: `apply` and `sweep` ### Applying Functions that Return Scalars The R function `apply` is a higher-order function that applies another function to the rows or columns of a matrix (or array, more on that later). ```{r} fred apply(fred, 1, mean) apply(fred, 2, mean) rowMeans(fred) colMeans(fred) ``` Since there are the convenience functions `rowMeans` and `colMeans`, we don't need to use `apply` with `mean` as its third argument. There are also the convenience functions `rowSums` and `colSums`, so we don't need to use `apply` with `sum` as its third argument. But that's it for the convenience functions. For every other function we want to apply we need `apply`. ```{r} fred apply(fred, 1, max) apply(fred, 2, max) ``` And ```{r} apply(fred, 1, function(x) quantile(x, 0.75)) apply(fred, 2, function(x) quantile(x, 0.75)) apply(fred, 1, quantile, probs = 0.75) apply(fred, 2, quantile, probs = 0.75) ``` For those who have not seen the term *quantile* before, we explain that it is the same as a *percentile* except without the percent. The 0.75 *quantile* of a numeric vector is the same as the 75th *percentile* of a vector. In statistics and probability theory we never use percents because they yield much more confusion than clarity. Most of the public cannot do arithmetic on percentages correctly. So we always insist that probabilities are numbers between zero and one and never percentages. Similarly in probability and statistics we always use quantiles and never percentiles. Even the term *95% confidence interval* has a competing term that does not use percents: a confidence interval with *0.95 coverage probability*. This example shows us two ways to apply a function that has more than one argument. We can use an anonymous function, or we can use the `...` argument of apply to pass other arguments to the function being applied. ### Applying Functions that Return Vectors Apply can also be used when the result of the function being applied is not a scalar. ```{r} apply(fred, 1, quantile, probs = (1:3)/4) apply(fred, 2, quantile, probs = (1:3)/4) ``` Sometimes this doesn't do what you think it should. The help for `apply` says > If each call to `FUN` returns a vector of length `n`, then `apply` > returns an array of dimension `c(n, dim(X)[MARGIN])` if `n > 1`. > If `n` equals `1`, `apply` returns a vector if `MARGIN` has length > 1 and an array of dimension `dim(X)[MARGIN]` otherwise. If `n` is > `0`, the result has length 0 but not necessarily the `correct` > dimension. So the dimension of the result that has length `n` is alway the first one (the row dimension if the result is a matrix). Sometimes this is the transpose of you think it *ought* to do. But it is what it is documented to do. ### Sweep Following Apply The R function `sweep` can be used with `apply` to further process a matrix. Suppose we want to subtract means to make rows of a matrix have mean zero. That is done by ```{r} moo <- apply(fred, 1, mean) fred.mean.free.rows <- sweep(fred, 1, moo) rowMeans(fred.mean.free.rows) ``` But `sweep` is a higher-order function because its fourth optional argument is the function to be used instead of subtraction (see the help page). ## More on Quantiles The *q*th quantile of a continuous random vector $X$ is any number $x$ (which need not be unique) such that $$ \Pr(X \le x) = q $$ The *q*th quantile of a discrete random vector $X$ is any number $x$ (which need not be unique) such that $$ \begin{aligned} \Pr(X \le x) & \ge q \\ \Pr(X \ge x) & \ge 1 - q \end{aligned} $$ But the R function `quantile` doesn't use this definition (or doesn't *only* use this definition) because it is thinking of the vector whose quantiles it is calculating as a random sample from a population and it is trying to *estimate* the quantile of the *population*. And it has nine (!?) different ways to do that. The theoretically correct definition given above corresponds to `type = 1` or `type = 2`. More TIMTOWTDI. The default is (currently) `type = 7`. ## What is a Matrix *Really?* A matrix is just an atomic vector that has certain *attributes*. ```{r} mode(fred) class(fred) is.atomic(fred) is.vector(fred) is.matrix(fred) attributes(fred) ``` I'm surprised. I thought `is.vector(fred)` was going to say `TRUE`. Shows what I know. We can make a matrix without using any of the matrix-oriented functions. ```{r} fred.too <- 1:6 attributes(fred.too) <- list(dim = 3:2, dimnames = list(c("red", "white", "blue"), c("car", "truck"))) identical(fred, fred.too) ``` TIMTOWTDI. But you shouldn't do the latter in real life. Too tricky. It might confuse people looking at your code. # Arrays Arrays are just like matrices but with more dimensions. ```{r} alice <- array(1:24, dim = 2:4) alice ``` This is a little hard to read, and it gets even harder when there are more dimensions. You can think of this array as being a three-dimensional table, and what R has printed as being four "slices" of the table, each of which is two-dimensional, hence a matrix. The printout is trying to say the same thing as ```{r} alice[ , , 1] alice[ , , 2] alice[ , , 3] alice[ , , 4] ``` As was the case with matrices, R stuffs vectors into an array in FORTRAN order (first index changes the fastest, second index next fastest, and so forth) rather than C/C++ order. According to their help pages, `apply`, `sweep`, and `outer` work on arrays. But I can never remember exactly how they work and always have to look up in the help or try little experiments to figure out what they do. ```{r} apply(alice, 1, mean) apply(alice, 1:2, mean) ``` So the second argument of `apply`, which is named `MARGIN`, gives the indices of the array which are to be indices of the result. In our first example (`MARGIN = 1`) the result is the average over the components of the array having specified values of the first index. It is the same as ```{r} mean(alice[1, , ]) mean(alice[2, , ]) ``` In our second example (`MARGIN = 1:2`) the result is the average over the components of the array having specified values of the first and second indices. It is the same as ```{r} mean(alice[1, 1, ]) mean(alice[2, 1, ]) mean(alice[1, 2, ]) mean(alice[2, 2, ]) mean(alice[1, 3, ]) mean(alice[2, 3, ]) ``` except that it stuffs these six numbers into a two by three matrix corresponding to the dimensions for these indices in the original matrix. I hope it is clear from this example that * `apply` is really powerful, and * `apply` is tricky (use with care). When you use `sweep` and `apply` together, you generally want their `MARGIN` arguments to agree. ```{r} foo <- sweep(alice, 1:2, apply(alice, 1:2, mean)) foo apply(foo, 1:2, mean) ``` That is all I have to tell you about `apply`, `sweep`, and `outer` applied to arrays. They are not easy to use. They are just at the edge of the level of expertise we want you to develop in this course. In thirty years of using R and its predecessor S, I have used `apply` on arrays a lot, `sweep` on arrays rarely (less than once a year), and `outer` on arrays never (but now that I know it works on arrays, I may try to use it in the future). Unlike matrices, arrays do not have a counterpart in mathematics but do have counterparts in FORTRAN, C, C++, and other computer languages. Every computer language I have ever heard of has arrays. They are a basic storage structure. But they are not used anywhere near as much in R as in other languages, because they don't work well with R models (which is another handout). Like matrices, arrays are really just atomic vectors with a `dim` attribute that says what their dimensions are. They may also have a `dimnames` attribute that gives labels for their whatevers (I don't know what to call them; a matrix has rows and columns; an array has rows, columns, and whatever comes after rows and columns, but I don't think there is a word for that). # Tables An R object of class `"table"` is just an array that is intended to be treated as a contingency table, suitable data for Stat 5421 (Categorical Data Analysis). ```{r table} library(CatDataAnalysis) data(exercise_8.18) foo <- xtabs(counts ~ ., data = exercise_8.18) class(foo) foo ``` Note that tables use the slick trick of the `dimnames` having `names` so the dimensions have names as well as the cells of the table along each dimension. ```{r table.dim} dimnames(foo) ``` Also there is the function `ftable` so simplify printing of tables. ```{r table.ftable} ftable(foo) ``` # Data Frames ## Introduction A data frame is something that prints out like a matrix, but it isn't a matrix. It is best thought of as a collection of named vectors all of the same length. It was originally designed for data for R models (which is another handout). Typically the response and predictor vectors for a model must be the same length. Data frames are designed to hold such vectors. ## Package `datasets` There are many data frames that come with R and are used for examples. ```{r} foo <- eapply(as.environment("package:datasets"), is.data.frame) foo <- unlist(foo) length(foo) sum(foo) ``` This says there are `r length(foo)` datasets in the R package named `datasets`, which is always already attached when R starts (so there is no need to say `library(datasets)` before trying to access one), and `r sum(foo)` of these are objects of class `"data.frame"`. The R function `eapply` used here applies R functions, here `is.data.frame` to all the objects stored in an R environment, which we haven't covered yet, so we shall say no more here about that. ## Toy Data Frame Here is an example data frame. ```{r} nam <- c("alice", "bob", "clara", "dan", "eve", "fred") dat <- data.frame(name = nam, gpa = pmin(4.0, rnorm(length(nam), 3.0, 0.5)), class = sample(c("freshman", "sophomore", "junior", "senior"), length(nam), replace = TRUE)) dat ``` As we already said, data frames are especially designed to work with R functions that fit models and many such have an argument, usually named `data` that takes a data frame in which to find data. For example ```{r} summary(aov(gpa ~ class, data = dat)) ``` This does the analog of a two-sample, independent sample, $t$-test assuming equality of variances for the difference of means of two populations, when two is replaced by four (because there are four classes, except that how many classes we have in these data is random, so sometimes three classes). But modeling is the subject of another handout, so back to data frames. ## Indexing (Extracting, Subscripting) Data frames are as we have already said, and will say some more, are really lists. But they look like matrices when printed. The reason is that data frames are not *just* lists. ```{r} class(dat) ``` They know they are data frames, and R generic functions like print and the extractor functions (`"["`, `"[["`, `"[<-"`, `"[[<-"`, `"$"`) know how to deal especially with data frames. So the list extractors work on data frames as if they were lists and the matrix extractors work on data frames as if they were matrices. ```{r} dat$name dat[["name"]] dat[[1]] dat[1, 1] dat[1, -1] dat[dat$gpa > 3, ] ``` Great! But what was that nonsense about "levels". ## Factors For a bit we stop talking about data frames and start talking about factors, which is R's name for the kind of object it uses for categorical data. Where other people say "categorical data" (as in the title of STAT 5421 here at U of M) R says `factor`. Factors are somewhat obsolete. If you have a variable in a model that has class `"character"`, then this will be treated like a factor in regression. Here is a logistic regression example. ```{r factor.ex1} n <- 30 y <- sample(0:1, n, replace = TRUE) x <- sample(c("red", "white", "blue"), n, replace = TRUE) x out1 <- glm(y ~ x, family = binomial) summary(out1) ``` (No surprise nothing is significant. We simulated `y` to be independent of `x`.) But this idiom does not work if the categorical variable has numeric values. ```{r factor.ex2, error = TRUE} n <- 30 x <- match(x, c("red", "white", "blue")) x out <- glm(y ~ x, family = binomial) summary(out) ``` Oops! R function `glm` treated `x` like a quantitative variable rather than a categorical one (should have been three regression coefficients, rather than two). R function `factor` tells R to treat `x` correctly. ```{r factor.ex3, error = TRUE} n <- 30 x <- factor(x) x out2 <- glm(y ~ x, family = binomial) summary(out2) ``` And, despite the regression coefficients being different. We have fit the same model as we did initially. ```{r ex13} all.equal(out1$fitted, out2$fitted) ``` If we look up `??factor` we find that R has the following functions all documented on one help page factor ordered is.factor is.ordered as.factor as.ordered addNA that deal with factors. The categories of the categorical variable are called the "levels" of the factor in R-speak. Although the `See Also` section of `?factor` doesn't tell you this, the R functions levels nlevels tell you what the levels of a factor are and how many of them there are. The `See Also` section of `?levels` tells us that there are also functions relevel reorder (which I have never heard of before) that are useful in futzing with the levels of a factor. The R function `ordered` makes ordered factors, which is R-speak for what other people call ordered categorical data. They are also beyond the scope of this course (should be mentioned at least briefly in STAT 5421). We can use R function levels to put back the original levels (values) of `x`. ```{r levels} levels(x) <- c("red", "white", "blue") x out3 <- glm(y ~ x, family = binomial) summary(out3) all.equal(out1$fitted, out3$fitted) ``` ### What is a Factor *Really?* ```{r} attributes(x) x.strip <- x attributes(x.strip) <- NULL x.strip ``` So an R factor object is *really* an integer vector. All the rest of the information is in the attributes The way one gets a class whose levels are character strings back to its original form is to run it through `as.character` or `as.vector`. ```{r} as.character(x) as.vector(x) ``` You can never be really sure what a factor is going to be treated as (the integer vector it really is or the character vector you think it is) So running a factor through `as.character` is a good idea whenever you want the character vector that is how you think of the factor. (R might do this by some sort of default coercion, but it also might not.) ### Avoiding Factors Factors are much less essential than they used to be. If a variable of type character is used as a predictor variable in an R model, then it is treated as if it had been converted to a `factor` variable. Moreover, R has had this behavior with a warning since version 2.4.0 (released October 2006) and the warning was removed in R 3.0.0 (released April 2013). Also character vectors used to take up a lot more space than the corresponding factors, but since version 2.6.0 (released October 2007) R has only one copy of every unique character string in any character vectors. All character vectors containing that string just point to it (in the C code implementing R). Since version 2.4.0 (released October 2006) the R functions as.data.frame data.frame expand.grid rbind read.table read.csv read.csv2 read.delim read.delim2 have an argument `stringsAsFactors` which says whether the default behavior of converting character vectors to factors should be done when producing a data frame. The default for this is ```{r} args(data.frame) ``` and the documentation `?data.frame` and `?options` explains that to never have default conversion of strings to factors one does ```{r eval=FALSE} options(stringsAsFactors = FALSE) ``` (except we did not actually do this for this document, that is, the preceding code chunk had `eval=FALSE` as an option). And if you want to have that whenever you start up R, you can put it in your `.Rprofile` file (`?Rprofile` explains that). If this seems an incredible mess, it is. This shows the consequences of bad design decisions. They are very hard to partially undo and and impossible to completely undo (while maintaining backward compatibility so you do not break working code and have users hate you). Lest anyone think your humble author is being snooty here, I admit to having made bad design decisions myself. As of R version 4.0.0 the default for ```{r} options("stringsAsFactors") ``` changes from `TRUE` to `FALSE`. As explained above, for the most part this is unnoticed by users but can be a pain. I spent two days debugging a problem that arose because I was using R-4.0.0 and a user I had given my code to was using an earlier version of R so we had different defaults of `stringsAsFactors`. But now that almost every R user is using R version 4.0.0 or later, this is less of an issue. ### Needing Factors But the whole notion of factors is not completely unnecessary. If someone insists on factor levels that are numeric or logical, then `data.frame` and friends cannot know that this variable is intended to be a factor. And one has to tell it that explicitly. If `fred` is a data frame containing a variable named `sally`, then there are a number of ways (TIMTOWTDI) ```{r} fred <- fred.save <- fred2 <- data.frame(sally = 1:10, herman = 11:20) fred2$sally <- as.factor(fred2$sally) fred3 <- transform(fred, sally = factor(sally)) fred4 <- within(fred, sally <- factor(sally)) identical(fred, fred.save) # have not clobbered original identical(fred, fred2) # supposed to be FALSE identical(fred2, fred3) # supposed to be TRUE identical(fred3, fred4) # supposed to be TRUE ``` ## What is a Data Frame, *Really?* ```{r} dat.strip <- dat attributes(dat.strip) <- NULL dat.strip typeof(dat.strip) attributes(dat) ``` A data frame is *really* just a list having certain attributes and certain properties that are enforced by the function `data.frame` and friends. ## Reading a Data Frame By ``reading'' we mean doing I/O, getting stuff from outside of R into R and in this case, specifically as a data frame. The standard methods for doing this are read.DIF read.fortran read.fwf read.table read.csv read.csv2 read.delim read.delim2 which were mentioned above. There are even more methods in the recommended package `foreign` for reading even crazier formats (such as those from proprietary statistical packages). These read from a file or from a "readable text-mode connection" which may be a URL (internet address). We saw the latter in action during the first quiz. Here's another example ```{r} foo <- read.table("http://www.stat.umn.edu/geyer/3701/data/array-ex1.txt", header = TRUE) head(foo) ``` Looking at that URL shows what the data format is. It is a whitespace-separated plain text file with variable names in the first row and elements of the variables in the rest of the rows, each column being one variable (name and data). Many users are in the (bad IMHO) habit of using spreadsheets (Microsoft Excel, for example) as data editors. So then how does one get the data out of the spreadsheet and into R? Spreadsheets can write CSV (comma separated values) format, and R can read that. (To make a CSV file, select "Save As" on the File menu, and then choose the output format to be CSV.) ```{r} foo.too <- read.csv("http://www.stat.umn.edu/geyer/3701/data/array-ex1.csv") identical(foo, foo.too) ```