A Dummies Guide to Smart Prediction ===================================== This document describes the "why" of the smartpred library and should be read by all its users. It is in the form of a series of questions and answers. The technical level has been kept purposely low; another document (TechnicalGuide.txt) gives technical information about the smartpred library, and is aimed at writers of smart functions. Most users of the smartpred library need only read this document. Here are the questions: 1. When is the smartpred library useful? 2. Why is there a need for the smartpred library? 3. Can you illustrate the problem? 4. Which functions are faulty? 5. How can I get smartpred running? 6. Explain why the problem is present. 7. How does smartpred fix up the problem? 8. I notice that lm() and glm() have smart arguments, e.g., lm(..., smart=T). Why is this so? 9. What about the S-PLUS gam() function? 10. In R, why do I have to type library(splines) before library(smartpred) before I use bs() or ns()? In SPLUS, why do I have to type library(smartpred, first=T)? 11. Are there any dangers with smart prediction? 12. How can I know if a function is smart? 13. Doesn't R 1.6.0 remove the need for smart prediction? ---------------------------------------------------------------- 1. Question. When is the smartpred library useful? Answer. When using Splus and: (a) you use modelling functions such as lm() and/or glm(), (b) predict from it, and (c) the formula contains data-dependent terms. For example, fit <- lm(y ~ I((x-mean(x))/sqrt(var(x)))) p <- predict(fit, newdat) fits a simple linear regression to a standardized variable x but p will generally be wrong. Question 3. provides a full example. ---------------------------------------------------------------- 2. Question. Why is there a need for the smartpred library? Answer. Prediction on the results of the lm() function can be wrong because of its inability to predict correctly. This occurs when the formula contains data-dependent terms. When lm() was written about 10 years ago, it was very difficult to decide which was better for prediction - a general solution that works reasonably well all the time, or a specific solution (like presented here) which works perfectly well most of the time. The former option was chosen. We believe that it would be a good idea for the latter option to be made available too. The smart prediction library is a proposed method of solution. The solution presented here is not foolproof; see Question 11 for what can go wrong. We believe the smartpred library is likely to be welcomed by all users who know enough about formulas and programming in order to use it right. ---------------------------------------------------------------- 3. Question. Can you illustrate the problem? Answer. Try the following in Splus: n <- 20 set.seed(86) # For reproducibility of the random numbers x <- sort(runif(n)) y <- sort(runif(n)) fit <- lm(y ~ ns(x, df=5)) plot(x, y) lines(x, fitted(fit)) newx <- seq(0, 1, len=n) points(newx, predict(fit, data.frame(x=newx)), type="b", col=2, err=-1) The two curves won't be the same! You can repeat the above with "ns" replaced by the "bs" and "poly" functions. The results will also show the problem. Also, scale(x) won't work either. Unfortunately, glm() does not fare any better: y <- round(sort(runif(n)) * 10) # Make sure they are integers fit <- glm(y ~ poly(x, 2), poisson) plot(x, log(y)) lines(x, predict(fit)) points(newx, predict(fit, data.frame(x=newx)), type="b", col=2, err=-1) ---------------------------------------------------------------- 4. Question. Which functions are faulty? Answer. S-PLUS's lm() and glm() have the problem. This was also true for R 1.5.1 and earlier. R 1.6.0 and later provides a solution, but it is not a complete solution (see Question 13). The smartpred library fixes the problem. The result is that, with the smartpred library, prediction with both lm() and glm() should now be correct. The specific functions that need modification are found in files such as smart151.R (R 1.5.1) or smart60.q (S-PLUS 6.0). They include lm(), glm(), predict.lm(), predict.mlm(), bs(), ns(), poly() [via poly.raw()], scale() and/or scale.default(). ---------------------------------------------------------------- 5. Question. How can I get smartpred running? Answer. If you are running Linux or Unix, and using R or S-PLUS, then provision is made to install it as a library. Otherwise, you can simply source the appropriate file, depending on your version. These are smart60.q S-PLUS 6.0 smart151.R R 1.5.1 or higher For example, if you use S-PLUS 6.0, typing something like source("smart60.q") will read in the modified functions. You will get warning statements which you can ignore. We haven't taken the time to provide scripts to install smartpred as a library for Windows. Sorry! You should run the code in Question 2 above to check that you get the right answers with smartpred. ---------------------------------------------------------------- 6. Question. Explain why the problem is present. Answer. Let's suppose you fit a model using lm() or glm(), e.g., fit <- lm(y ~ poly(x, 3)) Then if you predict using the model on new data, e.g., p <- predict(fit, newdat) then poly(x, 3) gets re-evaluated the second time on newdat$x. The result p is problem because the function "poly" is a data-dependent function (see Chambers and Hastie, pp.288,108.): its results depend in a complicated way on the specific data inputted into it. Other functions to watch out for are: "bs" (B-splines), "ns" (natural splines), and "scale" (Scale a vector or columns of a matrix). This is because, for "bs" and "ns", the choice of knot location is data-dependent. For "scale", the center and scale parameters are data-dependent. In general, incorrect predictions will occur with any formula with a term that is data-dependent. Here are some other examples where prediction on those models will be wrong: fit1 <- lm(y ~ I((x-mean(x))/sqrt(var(x)))) fit2 <- lm(y ~ I((x-min(x))^2)) In fact, any function such as min(), max(), var(), range(), or functions computing quantiles etc. will give incorrect predictions. It is because their values, when predicting on new data, will be different from their values on the original data. Consequently, the fitted regression coefficients (e.g., fit$coefficients) no longer have meaning for the new data set being used for prediction. ---------------------------------------------------------------- 7. Question. How does smartpred fix up the problem? Answer. When a data-dependent function such as poly() is first used in lm() on the original data, its data-dependent "parameters" are saved somewhere. Just before lm() finishes, the parameters are retrieved and attached to the object (in, e.g., fit$smart.prediction). Thus the parameters are saved on the object, much like contrasts. They are re-used when predictions are made from the model: at prediction time, the parameters from the object are copied somewhere, and when poly() is run the second time, it reads in the original parameters from the somewhere and then makes use of them. The "somewhere" is a specified location, and specially reserved variables are used throughout by smartpred. We call the whole scheme "smart" prediction. It requires the data-dependent functions (such as poly(), scale(), bs(), and ns()) to be "smart" functions. They must be smart enough to know about writing to and reading from the "somewhere". Furthermore, they must act 'normal' when used outside functions such as lm(). Additionally, modeling functions such as lm() and glm() must be able to set things up and attach the parameters to the object. Also, prediction functions such as predict.lm() and predict.glm() must be able to write the parameters to the somewhere before actually doing the prediction. ---------------------------------------------------------------- 8. Question. I notice that lm() and glm() have smart arguments, e.g., lm(..., smart=T). Why is this so? Answer. There may be times when you are certain that smart prediction is not needed or wanted. For example, you write a function that calls lm(). In such cases, having the argument gives you some control on this feature. In general, users should not use this argument. ---------------------------------------------------------------- 9. Question. What about the S-PLUS gam() function? Answer. We have not made any modifications to gam(). The function predict.gam() uses the method of "safe" prediction, which is a little difficult to explain (see Chambers and Hastie, pp.288,289). The function predict.gam() works on non-GAM objects. Smart prediction has the following advantages over safe prediction: a) the prediction is faithful to the original model, b) it is faster because a second model need not be fitted and/or is more direct. On the other hand, safe prediction will never fail, while smart prediction will only work if one remembers to "smarten" one's functions. Some functions will be more difficult to smarten, such as some function for generating a new weird basis set which your friend gave you, and you have no clue on how it works (nor want to know) but you would like to test it out. See the Technical Guide for details about smartening up a function. It should be noted that the functions vglm() and vgam(), which are multivariate versions of glm() and gam(), are currently being written. It currently runs under S-PLUS 6.0 and R 1.5.1, and the latest version is available starting at http://www.stat.auckland.ac.nz/~yee. It features smart prediction. ---------------------------------------------------------------- 10. Question. In R, why do I have to type library(splines) before library(smartpred) before I use bs() or ns()? In SPLUS, why do I have to type library(smartpred, first=T)? Answer. In R, bs() and ns() calls compiled C code, which has to be loaded in. Typing library(splines) does that. Because one must use smart versions of bs() and ns(), typing library(splines) before library(smartpred) ensures the smart versions are used. In SPLUS, using first=T means all smartpred functions are used instead of previous ones. ---------------------------------------------------------------- 11. Question. Are there any dangers with smart prediction? Yes. Smart prediction has a lurking danger - users can get lulled into complacency, and then let an unsmart function slip in unnoticed, and make errors without being aware. For example, a formula in lm() involving ns() will predict correctly only if they use the smart versions of lm(), predict.lm() and ns(). If one of these functions is accidentally deleted or the smartpred library not read in, then users will accidentally use non-smart versions. A careful examination of lm()$smart.prediction is one way of checking, but this is error-prone if there are many ns(), bs() and poly() terms in the formula. Another way of checking is to predict on several subsets of the original data, and then compare these to the original fitted values. However, this can be a little laborious. ---------------------------------------------------------------- 12. Question. How can I know if a function is smart? The function is.smart() is also provided. Its usage is, e.g., is.smart(bs), which returns a TRUE or FALSE. To check that smart prediction works, it is sufficient to check that the modelling function, prediction function and the data-dependent functions are all smart. For example, is.smart(glm), is.smart(predict.glm), is.smart(poly). If all three are true then smart prediction should work with glm(y ~ poly(x, 3), binomial), say. Be careful, however, because scale() actually calls scale.default() because scale() in R1.5.1 is a generic function. Then scale.default() should be smart but is.smart(scale) cannot be TRUE in general. In Splus6 poly() eventually calls poly.raw(), so both are assigned to be smart. -------------------------------------------------------------------- 13. Question. Doesn't R 1.6.0 remove the need for smart prediction? No. R 1.6.0, which was released on 1 October 2002, implements an elegant and intelligent version of safe prediction. But it doesn't always work. It will work for individual terms like bs(x, 3), scale(x), ns(x, 3), poly(x, 3) but it will fail when used inside any other function. For example, the terms cos(scale(x)), bs(scale(x), 3), ns(scale(x), 4), scale(scale(x)) will fail. It will even fail for I(scale(x))! Smart prediction is probably the only complete solution. -------------------------------------------------------------------- Acknowledgements: The smartpred library implements an algorithm devised by John Chambers and Trevor Hastie at the time lm() and glm() etc. were written. We also thank Drs W.N. Venables and B.D. Ripley for helpful comments. ---------------------------------------------------------------- Authors: Thomas Yee (yee@stat.auckland.ac.nz) and Trevor Hastie (hastie@stat.Stanford.EDU). Please send any suggestions and bug reports to Thomas. The smart prediction library is available at http://www.stat.auckland.ac.nz/~yee This document last modified: 9 October 2002 References. =========== Chambers, J. M. and Hastie, T. J. (eds.) (1992) Statistical Models in S. Wadsworth and Brooks/Cole, Pacific Grove, CA.