Statistics 3011 (Geyer and Jones, Spring 2006) Examples: Scatter Plots and Correlation

Contents

Scatter Plots

For our example we will use the data from Example 4.3 in the Textbook (Moore) which is in the URL

http://www.stat.umn.edu/geyer/3011/mdata/chap04/eg04-03.dat

so we can use the URL external data entry method.

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

Comments

The R function plot (on-line help) draws scatter plots and many other kinds of plots.

When given just two arguments, which are numeric vectors of the same length, as in the first line of the example, it makes a scatter plot.

Looking in the data URL linked above we see the variables named PctS and SATV seem to be the ones in the data set that are the x and y coordinates of the plot made in the book, so those are the variables we use. (And we see that our plot does match the one in the book, so we must have been right.)

The next command adds the nice labels (also copied from the figure in the book) to the scatter plot using the optional arguments xlab and ylab. The plot is the same, only the labels are different (the default is to use the variable names themselves, which aren't very relevant).

Colorful Scatter Plots

We will redo the example of the preceding section, this time painting some of the points red, as the book does in Example 4.6 (Figure 4.3, p. 87).

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

Comments

The R function points (on-line help) adds points to an already existing plot (drawn by plot).

The optional argument pch = 19 given to both plot and points says to make solid dots rather than hollow dots (which is the default). The red and black hollow dots were too hard to distinguish in class.

The optional argument col = "red" given to points makes those points red. Many more colors are available.

The commands

PctS.South <- PctS[Southern]
SATV.South <- SATV[Southern]

make the variables that are the x and y coordinates of the points we want to paint red. (In the homework problem in which you need to do this, those variables are already made. So you don't need statements like this.)

The variable Southern is logical (values TRUE and FALSE). Hence PctS[Southern] and SATV[Southern] are instances of Logical Vector Indexing. They pick out the cases for which Southern is TRUE.

Irrelevant Comment

Not that it has anything to do with scatter plots, but clearly the variable Southern is irrelevant to anything about this scatter plot.

Careful study of the scatter plot or knowledge about the ACT and SAT tell you that there are two groups of states.

State[PctS < 30]
State[PctS > 50]
State[30 <= PctS & PctS <= 50]

should tell you all about that.

Correlation

The R function cor (on-line help) calculates correlations. Using the same data as we used for the scatter plot, we also calculate the correlation as follows.

External Data Entry

Enter a dataset URL :

Or

Select a local file to submit:

Correlation Simulations

In this section, we combine the two things we already learned (R function plot and cor) with another trick, simulation, which is a fancy name for making up data (with certain properties).

Here we make up data following the bivariate normal distribution (which we haven't learned anything about yet, but will).

Comments

The R function rnorm (on-line help) simulates normally distributed data, by default standard normal.

Because the data are simulated, the plot will be different each time you click the Submit button.

The first two lines assign a sample size n and a simulation truth correlation cor.true.

The latter is not the actual correlation but rather the correlation in an ideal probability model for the data. Later when we learn the distinction between populations and samples and between parameters and statistics (Chapter 10 in the textbook), we will say that cor.true is the parameter and the calculation calculated by cor(x, y) is the statistic.

For now, you can just consider that n and cor.true are just two knobs that you can fiddle with to get different simulations.

To get more points in the plot, increase n. (Just edit the text in the web form and resubmit).

To get a different shaped plot, change cor.true. Remember that a correlation must be between −1 and 1.

Redo with different values of cor.true to get an idea how correlation works.

Correlation Measures Linear Association

More precisely, this section is about how correlation does not measure non-linear association.

We redo our simulation example with non-linear association.

Comments

Now correlation makes no sense, so we have no cor.true (if we did have it, it would be zero).

The only thing to fiddle with is the number of points n.

The R function curve (on-line help) adds curves to plots, in this case the curve with equation y = x2, which is the red line in the plot and which we will eventually (perhaps, this is not in the textbook) learn to say is the true regression function for the simulated data.

The Moral of the Story

There is a clear association between the two variables. Y values are high at each end of the plot and low in the middle, that is, high for X values that are large positive or large negative and low for X values near zero.

But since the association is non-linear, correlation doesn't measure it.