Final Exam, Stat 8061, Fall 2001

This exam is due on Thursday, December 20, at 2:00PM to Dana in 313 Ford, or by 4:30 the same day to me in 146H Classroom Office Building. If you cannot finish the exam on time, please contact me; late exams may not be graded until January. Do not leave your exam in my mailbox. Be sure you keep a copy of the exam. Graded exams, solutions, and course grades will be ready on Friday, December 21 by the end of the day. Good luck!

You may not discuss this exam with anyone except the instructor. I will have office hours in 362 Ford on: Fri Dec. 14, 1:15-2:15, Mon Dec 17, 3:30-4:30, and Wed Dec. 18, 11-12 and 1-3. I can be reached by email any time. If there are any changes or corrections to the exam, I will send email.

Problem #1

The number of crustacean zoöplankton species present in a lake can be different, even for two nearby lakes. Species diversity (crudely measured by the number of species present in the lake) is of interest to ecologists. The data in the file lakes.lsp on Linux, and on the class Web page, gives the number of known crustacean zoöplankton species for 69 Northern Hemisphere lakes. The data were complied from many sources, and include the Great Lakes, as well as small ponds located near universities (e.g., Cornell Pond 209 or Lechemere near Harvard) that are easily studied. Also included are a number of characteristics of each lake. There are some missing values, indicated with a ``?" in the data file (be sure to reread the sections of the book that describe how Arc works with missing data). The overall goal of the analysis is to understand how the number of species present depends on the other characteristics of the lake.

Description of variables These data give the number of known crustacean zooplankton species for 69 world lakes. Also included are a number of characteristics of each lake. There are missing values.

Name      Type    n    Info
Area      Variate 69   Lake area, in hectares
Cond      Variate 50   Specific conductance, micro Siemans
Dist      Variate 69   distance to nearest lake, km
Elev      Variate 69   Elevation, m
Lat       Variate 69   N latitude, degrees
Long      Variate 69   W latitude, degrees
MaxDepth  Variate 69   Maximum lake depth, m
MeanDepth Variate 69   Mean lake depth, m
NLakes    Variate 69   number of lakes within 20 km
Photo     Variate 47   Rate of photosynthesis, mostly by the 14C method
Species   Variate 69   Number of zooplankton species
Lake      Text    69   Name of Lake

Several questions are of interest to the ecologists, and your assignment is to provide specific answers to these specific questions. Each answer should consist of two parts: (a) your answer to the question, and (b) how you got that answer. Short answers are generally adequate. Some of the questions may not be answerable from these data. If there is no answer, state why no answer is possible. As is true in the real world, some questions are poorly worded and vague, and will require interpretation on your part, so many different answers may be possible. Some of the questions require only a little work, while others require lots of work.

  1. Justify in a sentence the use of Poisson regression models in this problem. If you choose not to use Poisson regression, justify not using Poisson regression.
  2. Ignoring all other predictors, how does the number of species depend on the depth of the lake? Which is a more important predictor, MeanDepth or MaxDepth, or are both required?
  3. The variables Lat and Long describe the location of the lake on the earth. Is it reasonable to infer from these data that the number of species is independent of Lat and Long?
  4. Which predictors can be ignored by the ecologists without important loss of information?
  5. Which lakes, if any, appear to be different from the others? How are they different? What is the impact of these lakes on overall conclusions?

Problem #2

Health care plans attempt to control their costs in a number of different ways. These data are to be used to understand the effectiveness of several strategies for controlling the costs of prescription drugs. Plans can determine co-payments, which are the fixed amount that the patient must pay for each prescription. Some plans use generic drugs more than others; generic drugs are supposed to be equivalent to name-brand drugs but cheaper. Another tool to control costs is in restricting the drugs that can be prescribed by physicians. For example, if three nearly equivalent drugs are available, the plan may require all physicians to prescribe only one of them, so the health plan can buy that drug in higher quantity for lower cost. The three variables CoPay, GS and RI, defined below, measure these three tools available to the health plan. The goal is to determine how these three strategies are related to cost.

The data file drugcost.lsp can be obtained from the class web page. On Linux, type (load "drugcost"). The data includes the response, three primary predictors of interest, and with a few other variables that characterize the health plan:

COST = Ave. cost to plan for 1 prescription for 1 day, THE RESPONSE
RXPM = Number of prescriptions per member per year
GS = % generic substitution
RI = Restrictiveness index (0=none, 100=total)
OC = 1 if oral contraceptives are covered, 0 else
COPAY = Average prescription co-payment AGE = Average age
F = % female members
MM = Member months = number of members * average number of months per member
ID = Plan name
The variable RXPM measures the overall use of drugs by the health plan; larger values mean more drug use. MM is a measure of the size of the plan.

What to turn in The question you are to answer is: How to CoPay, GS and RI impact prescription drug costs? Your solution should consist of two parts, a ``Summary" and ``Supporting Evidence." The summary will consist of: (1) a statement of your conclusions, with relevant summary statistics and probability statements. This should be at most 300 words. Your conclusions may be equivocal: for example, they might depend on whether or not a specific case is treated as an outlier. (2) AT MOST two graphical or numerical displays that are designed to convince someone familiar with statistical analysis that your analysis is sound, and that your conclusions are justified. Just giving a graph is NOT enough: you must explain what the graph shows and why it is interesting.

Your supporting evidence will consist of: (1) AT MOST 500 words explaining how you got your answer, with up to five figures/tables that support your text. Unlabeled or unreferenced computer output will count against you. Word limits will be strictly enforced.


S Weisberg
2001-12-14