For our example we will use the data from Example 4.3 in the Textbook (Moore) which is in the URL
so we can use the URL external data entry method.
The R function plot
(on-line help)
draws scatter plots and many other kinds of plots.
When given just two arguments, which are numeric vectors of the same length, as in the first line of the example, it makes a scatter plot.
Looking in the data URL linked above we see the variables
named PctS
and SATV
seem to be the
ones in the data set that are the x and y coordinates
of the plot made in the book, so those are the variables we use.
(And we see that our plot does match the one in the book, so we must
have been right.)
The next command adds the nice labels (also copied from the figure
in the book) to the scatter plot using the optional arguments
xlab
and ylab
. The plot is the same,
only the labels are different (the default is to use the variable
names themselves, which aren't very relevant).
We will redo the example of the preceding section, this time painting some of the points red, as the book does in Example 4.6 (Figure 4.3, p. 87).
The R function points
(on-line help)
adds points to an already existing plot (drawn by plot
).
The optional argument pch = 19
given to both plot
and points
says to make solid dots rather than hollow dots
(which is the default). The red and black hollow dots were too hard
to distinguish in class.
The optional argument col = "red"
given to points
makes those points red.
Many
more colors are available.
The commands
PctS.South <- PctS[Southern] SATV.South <- SATV[Southern]
make the variables that are the x and y coordinates of the points we want to paint red. (In the homework problem in which you need to do this, those variables are already made. So you don't need statements like this.)
The variable Southern
is logical (values TRUE
and FALSE
).
Hence PctS[Southern]
and SATV[Southern]
are
instances of Logical Vector Indexing.
They pick out the cases for which Southern
is TRUE
.
Not that it has anything to do with scatter plots, but clearly the
variable Southern
is irrelevant to anything about this
scatter plot.
Careful study of the scatter plot or knowledge about the ACT and SAT tell you that there are two groups of states.
State[PctS < 30] State[PctS > 50] State[30 <= PctS & PctS <= 50]
should tell you all about that.
The R function cor
(on-line help)
calculates correlations. Using the same data as we used for the
scatter plot, we also calculate the correlation as follows.
In this section, we combine the two things we already learned
(R function plot
and cor
) with another
trick, simulation, which is a fancy name for making up data
(with certain properties).
Here we make up data following the bivariate normal distribution (which we haven't learned anything about yet, but will).
The R function rnorm
(on-line help)
simulates normally distributed data, by default standard normal.
Because the data are simulated, the plot will be different each time
you click the Submit
button.
The first two lines assign a sample size n
and a simulation truth
correlation cor.true
.
The latter is not the actual correlation but rather the correlation
in an ideal probability model for the data. Later when we learn the
distinction between populations and samples and between parameters
and statistics (Chapter 10 in the textbook),
we will say that cor.true
is the parameter
and the calculation calculated by cor(x, y)
is the statistic.
For now, you can just consider that n
and cor.true
are just two knobs that you can fiddle with to get different simulations.
To get more points in the plot, increase n
.
(Just edit the text in the web form and resubmit).
To get a different shaped plot, change cor.true
.
Remember that a correlation must be between −1 and 1.
Redo with different values of cor.true
to get an idea how
correlation works.
More precisely, this section is about how correlation does not measure non-linear association.
We redo our simulation example with non-linear association.
Now correlation makes no sense, so we have no cor.true
(if we did have it, it would be zero).
The only thing to fiddle with is the number of points n
.
The R function curve
(on-line help)
adds curves to plots, in this case the curve with equation
y = x2, which is the red line in the plot
and which we will eventually (perhaps, this is not in the textbook)
learn to say is the true regression function for the simulated
data.
There is a clear association between the two variables. Y values are high at each end of the plot and low in the middle, that is, high for X values that are large positive or large negative and low for X values near zero.
But since the association is non-linear, correlation doesn't measure it.