A Dummies Guide to Smart Prediction
=====================================
This document describes the "why" of the smartpred library and should
be read by all its users. It is in the form of a series of questions
and answers. The technical level has been kept purposely low; another
document (TechnicalGuide.txt) gives technical information about the
smartpred library, and is aimed at writers of smart functions. Most
users of the smartpred library need only read this document.
Here are the questions:
1. When is the smartpred library useful?
2. Why is there a need for the smartpred library?
3. Can you illustrate the problem?
4. Which functions are faulty?
5. How can I get smartpred running?
6. Explain why the problem is present.
7. How does smartpred fix up the problem?
8. I notice that lm() and glm() have smart arguments, e.g.,
lm(..., smart=T). Why is this so?
9. What about the S-PLUS gam() function?
10. In R, why do I have to type library(splines) before library(smartpred)
before I use bs() or ns()?
In SPLUS, why do I have to type library(smartpred, first=T)?
11. Are there any dangers with smart prediction?
12. How can I know if a function is smart?
13. Doesn't R 1.6.0 remove the need for smart prediction?
----------------------------------------------------------------
1. Question.
When is the smartpred library useful?
Answer. When using Splus and:
(a) you use modelling functions such as lm() and/or glm(),
(b) predict from it, and
(c) the formula contains data-dependent terms.
For example,
fit <- lm(y ~ I((x-mean(x))/sqrt(var(x))))
p <- predict(fit, newdat)
fits a simple linear regression to a standardized variable x but p
will generally be wrong.
Question 3. provides a full example.
----------------------------------------------------------------
2. Question.
Why is there a need for the smartpred library?
Answer.
Prediction on the results of the lm() function can be wrong because of
its inability to predict correctly.
This occurs when the formula contains data-dependent terms. When lm()
was written about 10 years ago, it was very difficult to decide which
was better for prediction - a general solution that works reasonably
well all the time, or a specific solution (like presented here) which
works perfectly well most of the time. The former option was chosen.
We believe that it would be a good idea for the latter option to be
made available too.
The smart prediction library is a proposed method of solution. The
solution presented here is not foolproof; see Question 11 for what can
go wrong. We believe the smartpred library is likely to be welcomed by
all users who know enough about formulas and programming in order to
use it right.
----------------------------------------------------------------
3. Question.
Can you illustrate the problem?
Answer.
Try the following in Splus:
n <- 20
set.seed(86) # For reproducibility of the random numbers
x <- sort(runif(n))
y <- sort(runif(n))
fit <- lm(y ~ ns(x, df=5))
plot(x, y)
lines(x, fitted(fit))
newx <- seq(0, 1, len=n)
points(newx, predict(fit, data.frame(x=newx)), type="b", col=2, err=-1)
The two curves won't be the same! You can repeat the above with "ns"
replaced by the "bs" and "poly" functions. The results will also show
the problem. Also, scale(x) won't work either.
Unfortunately, glm() does not fare any better:
y <- round(sort(runif(n)) * 10) # Make sure they are integers
fit <- glm(y ~ poly(x, 2), poisson)
plot(x, log(y))
lines(x, predict(fit))
points(newx, predict(fit, data.frame(x=newx)), type="b", col=2, err=-1)
----------------------------------------------------------------
4. Question.
Which functions are faulty?
Answer. S-PLUS's lm() and glm() have the problem. This was also true
for R 1.5.1 and earlier. R 1.6.0 and later provides a solution, but it
is not a complete solution (see Question 13).
The smartpred library fixes the problem. The result is that, with the
smartpred library, prediction with both lm() and glm() should now be
correct.
The specific functions that need modification are found in files
such as smart151.R (R 1.5.1)
or smart60.q (S-PLUS 6.0).
They include lm(), glm(), predict.lm(), predict.mlm(), bs(),
ns(), poly() [via poly.raw()], scale() and/or scale.default().
----------------------------------------------------------------
5. Question.
How can I get smartpred running?
Answer.
If you are running Linux or Unix, and using R or S-PLUS, then provision
is made to install it as a library. Otherwise, you can simply source
the appropriate file, depending on your version. These are
smart60.q S-PLUS 6.0
smart151.R R 1.5.1 or higher
For example, if you use S-PLUS 6.0, typing something like
source("smart60.q")
will read in the modified functions. You will get warning statements
which you can ignore. We haven't taken the time to provide scripts to
install smartpred as a library for Windows. Sorry!
You should run the code in Question 2 above to check that you get the
right answers with smartpred.
----------------------------------------------------------------
6. Question.
Explain why the problem is present.
Answer.
Let's suppose you fit a model using lm() or glm(), e.g.,
fit <- lm(y ~ poly(x, 3))
Then if you predict using the model on new data, e.g.,
p <- predict(fit, newdat)
then poly(x, 3) gets re-evaluated the second time on newdat$x. The
result p is problem because the function "poly" is a data-dependent
function (see Chambers and Hastie, pp.288,108.): its results depend in
a complicated way on the specific data inputted into it. Other
functions to watch out for are:
"bs" (B-splines),
"ns" (natural splines), and
"scale" (Scale a vector or columns of a matrix).
This is because, for "bs" and "ns", the choice of knot location is
data-dependent. For "scale", the center and scale parameters are
data-dependent.
In general, incorrect predictions will occur with any formula with a
term that is data-dependent. Here are some other examples where
prediction on those models will be wrong:
fit1 <- lm(y ~ I((x-mean(x))/sqrt(var(x))))
fit2 <- lm(y ~ I((x-min(x))^2))
In fact, any function such as min(), max(), var(), range(), or
functions computing quantiles etc. will give incorrect predictions. It
is because their values, when predicting on new data, will be different
from their values on the original data. Consequently, the fitted
regression coefficients (e.g., fit$coefficients) no longer have meaning
for the new data set being used for prediction.
----------------------------------------------------------------
7. Question.
How does smartpred fix up the problem?
Answer. When a data-dependent function such as poly() is first used
in lm() on the original data, its data-dependent "parameters" are
saved somewhere. Just before lm() finishes, the parameters are
retrieved and attached to the object (in, e.g., fit$smart.prediction).
Thus the parameters are saved on the object, much like contrasts. They
are re-used when predictions are made from the model: at prediction
time, the parameters from the object are copied somewhere, and when
poly() is run the second time, it reads in the original parameters from
the somewhere and then makes use of them. The "somewhere" is a
specified location, and specially reserved variables are used
throughout by smartpred.
We call the whole scheme "smart" prediction. It requires the
data-dependent functions (such as poly(), scale(), bs(), and ns()) to be
"smart" functions. They must be smart enough to know about writing to
and reading from the "somewhere". Furthermore, they must act 'normal'
when used outside functions such as lm(). Additionally, modeling
functions such as lm() and glm() must be able to set things up and
attach the parameters to the object. Also, prediction functions such
as predict.lm() and predict.glm() must be able to write the parameters
to the somewhere before actually doing the prediction.
----------------------------------------------------------------
8. Question.
I notice that lm() and glm() have smart arguments, e.g.,
lm(..., smart=T). Why is this so?
Answer.
There may be times when you are certain that smart prediction is
not needed or wanted. For example, you write a function that calls
lm(). In such cases, having the argument gives you some control on
this feature. In general, users should not use this argument.
----------------------------------------------------------------
9. Question.
What about the S-PLUS gam() function?
Answer.
We have not made any modifications to gam(). The function
predict.gam() uses the method of "safe" prediction, which is a little
difficult to explain (see Chambers and Hastie, pp.288,289).
The function predict.gam() works on non-GAM objects.
Smart prediction has the following advantages over safe prediction:
a) the prediction is faithful to the original model,
b) it is faster because a second model need not be fitted and/or
is more direct.
On the other hand, safe prediction will never fail, while smart
prediction will only work if one remembers to "smarten" one's
functions. Some functions will be more difficult to smarten, such as
some function for generating a new weird basis set which your friend
gave you, and you have no clue on how it works (nor want to know) but
you would like to test it out. See the Technical Guide for
details about smartening up a function.
It should be noted that the functions vglm() and vgam(), which are
multivariate versions of glm() and gam(), are currently being written.
It currently runs under S-PLUS 6.0 and R 1.5.1, and the latest version
is available starting at http://www.stat.auckland.ac.nz/~yee. It
features smart prediction.
----------------------------------------------------------------
10. Question.
In R, why do I have to type library(splines) before library(smartpred)
before I use bs() or ns()?
In SPLUS, why do I have to type library(smartpred, first=T)?
Answer.
In R, bs() and ns() calls compiled C code, which has to be loaded in.
Typing library(splines) does that. Because one must use smart versions
of bs() and ns(), typing library(splines) before library(smartpred)
ensures the smart versions are used.
In SPLUS, using first=T means all smartpred functions are used instead
of previous ones.
----------------------------------------------------------------
11. Question.
Are there any dangers with smart prediction?
Yes. Smart prediction has a lurking danger - users can get lulled into
complacency, and then let an unsmart function slip in unnoticed, and make
errors without being aware. For example, a formula in lm() involving
ns() will predict correctly only if they use the smart versions of lm(),
predict.lm() and ns(). If one of these functions is accidentally deleted
or the smartpred library not read in, then users will accidentally use
non-smart versions. A careful examination of lm()$smart.prediction is
one way of checking, but this is error-prone if there are many ns(),
bs() and poly() terms in the formula. Another way of checking is to
predict on several subsets of the original data, and then compare
these to the original fitted values. However, this can be a little
laborious.
----------------------------------------------------------------
12. Question.
How can I know if a function is smart?
The function is.smart() is also provided.
Its usage is, e.g., is.smart(bs), which returns a TRUE or FALSE.
To check that smart prediction works, it is sufficient to check
that the modelling function, prediction function and the data-dependent
functions are all smart.
For example, is.smart(glm), is.smart(predict.glm), is.smart(poly).
If all three are true then smart prediction should work with
glm(y ~ poly(x, 3), binomial), say.
Be careful, however, because scale() actually calls scale.default()
because scale() in R1.5.1 is a generic function. Then
scale.default() should be smart but is.smart(scale) cannot be TRUE
in general. In Splus6 poly() eventually calls poly.raw(), so
both are assigned to be smart.
--------------------------------------------------------------------
13. Question.
Doesn't R 1.6.0 remove the need for smart prediction?
No. R 1.6.0, which was released on 1 October 2002, implements an elegant
and intelligent version of safe prediction. But it doesn't always work.
It will work for individual terms like
bs(x, 3), scale(x), ns(x, 3), poly(x, 3)
but it will fail when used inside any other function. For example,
the terms
cos(scale(x)), bs(scale(x), 3), ns(scale(x), 4), scale(scale(x))
will fail. It will even fail for I(scale(x))! Smart prediction is probably
the only complete solution.
--------------------------------------------------------------------
Acknowledgements:
The smartpred library implements an algorithm devised by John Chambers
and Trevor Hastie at the time lm() and glm() etc. were written. We
also thank Drs W.N. Venables and B.D. Ripley for helpful comments.
----------------------------------------------------------------
Authors:
Thomas Yee (yee@stat.auckland.ac.nz) and Trevor Hastie
(hastie@stat.Stanford.EDU). Please send any suggestions and bug
reports to Thomas. The smart prediction library is available at
http://www.stat.auckland.ac.nz/~yee
This document last modified: 9 October 2002
References.
===========
Chambers, J. M. and Hastie, T. J. (eds.) (1992)
Statistical Models in S.
Wadsworth and Brooks/Cole, Pacific Grove, CA.