Estimating dispersion and testing goodness-of-fit from past data

The functions est.dispersion and test.fit use past data to calibrate the PROJMAN system by finding or verifying a probability distribution and dispersion estimate.

Colour coding:
                              # comments
                              > user commands
                              computer output

# First follow the procedure outlined here to make the functions visible from within R.

# Source the functions and example data into R :

> source("est.dispersion")
> source("test.fit")
> pastdata <- read.table("pastdata", header=T)

# view the example past data:

> pastdata

name estimated observed
interpreter 40 36
photoedit 20 23
insurecalc 10 16
fingerprints 20 15
graphologist 20 35
archivist 40 50

etc.

# Interpretation of the columns:
# name: the name of project i .
# estimated: estimated total time, in person-days, required for completion of project i  from start to finish.
# observed: observed total time, in person-days, required for completion of project i  from start to finish.

# The estimated times could be obtained either from guesswork before the projects started,
# or as the output from a predictive model.
# The projects have been completed, so the observed times are known.



# The EST.DISPERSION function is most appropriate when the estimated efforts were
# obtained by guesswork before the projects started. If they were obtained as the output
# from a regression model, the dispersion would normally be estimated as part of the model,
# rendering separate estimation unnecessary.

# Apply the EST.DISPERSION function to the past projects in pastdata.
# The syntax is:
#
#                                est.dispersion(obs, est, distname)
#
# obs is the set of observed times (here extracted as pastdata$observed);
# est is the set of estimated times (here extracted as pastdata$estimated);
# distname is the selected distribution name (must be in quotes).

# distname should be one of: "norm" (Normal), "gamma" (Gamma), "lnorm" (Lognormal),
# or "lnorm.med" (Lognormal, median specification).
# If the Poisson or Chisquare distributions are to be used, no estimate of dispersion is necessary.
# dispersion is the variance for the Normal option, the variance/(mean squared) for the Gamma
# and Lognormal options, and the variance/(median squared) for the Lognormal-median option.

# EXAMPLE:
# estimate the dispersion for the example data pastdata,
# assuming that the effort of each project is Gamma with common unknown dispersion,
# and mean given by the esimated column in pastdata.
# Store the result with name disp.

> disp <- est.dispersion(obs=pastdata$observed, est=pastdata$estimated, distname="gamma")

Estimated dispersion: 0.0459748



# Now apply the TEST.FIT function to check whether this distribution is an adequate
# fit to the past data. The function performs Chisquare and Kolmogorov-Smirnov
# goodness-of-fit tests.

# The hypothesized distribution for project i  is distname with mean est[i] and dispersion as specified.
# The observation is obs[i] .

# For the Chisquare test, observations are grouped according to certain specified boundaries,
# and the number of observations in each grouping is compared with that expected if the
# hypothesized distribution is correct.
# For the Kolmogorov-Smirnov (K-S) test, the observations are transformed into a Uniform(0, 1)
# distribution and the empirical distribution function is compared against the Uniform(0, 1)
# distribution function.

# The syntax is:
#
#                                test.fit(obs, est, probs, distname, dispersion, n.param.est, central)
#
# obs, est, and distname are as above.
# The options "pois" (Poisson) and "chisq" (Chisquare) are now allowed for distname,
# in addition to "norm", "gamma", "lnorm", and "lnorm.med".
# dispersion is the dispersion, if appropriate, usually estimated from the data as above.

# probs specifies probability regions by which the the observations obs are grouped for the Chisquare test:
# e.g. if probs=c(0.25, 0.5, 0.75) (the default), then the first grouping comprises observations
# that lie in the first 25% probability region of their hypothesized distribution;
# the second grouping comprises those outside the first 25% region but inside the first 50% region,
# and so on.
# If central=TRUE, the probability regions are centered on the distribution median, so
# the first 25% probability region is the 25% of the distribution closest to the median.
# If central=FALSE (the default), the probability regions are not centered, so that the first 25%
# probability region is the lower 25% tail, and so on.
# It is generally better to use central=FALSE, because centralizing means that a surplus in the
# upper tail can cancel out a deficit in the lower tail, so a bad fit might not be detected.

# n.param.est specifies the number of parameters that have been estimated from the data,
# and it must be supplied even if 0.
# If the estimated values est were derived by guesswork only, then n.param.est may be 0 for the Poisson and
# Chisquare distributions, and 1 for the Normal, Lognormal, and Gamma distributions if dispersion
# was estimated from the data.
# If the estimated values est were derived from the data, add one to n.param.est for each parameter involved:
# e.g. add one to n.param.est if est = mean(obs),
# or add two to n.param.est if est is the output from a regression model with two parameters, as in
# est = alpha + beta * (some measurement).     (Here alpha and beta are the parameters).

THEORETICAL DETAILS:
# Reference: Kendall, M.G. and Stuart, A. (1979) Advanced Theory of Statistics, Vol 2, 4th Edition,
# sections 30.1 to 30.34.

# Function TEST.FIT uses the likelihood ratio test statistic rather than the Pearson X^2:
# the likelihood ratio statistic should be more accurate although it is less commonly used (see ref).

# The asymptotic distribution of the test statistic is bounded between the Chisquare distribution
# with df = (ngroups - n.param.est - 1) and the Chisquare distribution with df = (ngroups - 1)
# (see ref).
# This means we need to find the p-value for both of these distributions, and the output from the
# test is often vague.
# In addition, a reasonably large sample is needed for the asymptotic distribution to apply:
# this function rather arbitrarily gives a warning about sample sizes less than 25.

# The interpretation of the p-values given by the function is intentionally vague, and the tests
# should be regarded as a guideline, not a definite answer.
# The smaller the p-values, the less likely it is that the hypothesized distribution is correct;
# but exactly how small is small is arguable.

# EXAMPLE:
# Test the Gamma distribution, with dispersion estimated by est.dispersion above and stored in disp,
# on the pastdata data set. The estimated efforts were gained by guesswork, so only the dispersion
# has been estimated from the data and n.param.est=1.

> test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="gamma",
dispersion=disp, n.param.est=1, central=F)

         ===============================
         Chisquare Test: Non-centralized Regions
         ===============================

Lower and Upper denote the probability region bounding the category.

Actual.count is the number of observations that lie outside the
lower% region of their distribution, but inside the upper% region.

Category Lower Upper Expected.count Actual.count
1 0 25 7.5 4
2 25 50 7.5 6
3 50 75 7.5 10
4 75 100 7.5 10

Likelihood ratio chisquare test statistic: 3.800691

Compare test statistic against both Chisquare( df = 2 ) and Chisquare( df = 3 ) distributions:
p-value for asymptotic test is between 0.1495170 and 0.2838058

----------------------------------------------------------------------------------------------------

         =====================
         Kolmogorov-Smirnov Test
         =====================

Alternative hypothesis: two.sided
Statistic: D = 0.1951866
p-value: p = 0.2031577

----------------------------------------------------------------------------------------------------

Interpretation: Chisquare Test
=======================
The Gamma distribution specified may be appropriate for these data.

Interpretation: K-S Test
==================
The Gamma distribution specified may be appropriate for these data.

# Interpretation:
# There is no evidence from these tests that the Gamma distribution with this dispersion
# disp=0.0459748 is an inappropriate model.
# The p-values all lie between about 0.14 and 0.28. Visual inspection of the observed counts and
# expected counts in the table indicate a possible bias in the upper tails, but there is not enough
# evidence for this to be statistically significant.



# Now try the same experiments with the Normal distribution.
# First estimate dispersion:

> disp2 <- est.dispersion(obs=pastdata$observed, est=pastdata$estimated, distname="norm")

Estimated dispersion: 44.6

# Now test the fit:

> test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="norm",
dispersion=disp2, n.param.est=1, central=F)

         ===============================
         Chisquare Test: Non-centralized Regions
         ===============================

Lower and Upper denote the probability region bounding the category.

Actual.count is the number of observations that lie outside the
lower% region of their distribution, but inside the upper% region.

Category Lower Upper Expected.count Actual.count
1 0 25 7.5 3
2 25 50 7.5 9
3 50 75 7.5 9
4 75 100 7.5 9

Likelihood ratio chisquare test statistic: 4.34762

Compare test statistic against both Chisquare( df = 2 ) and Chisquare( df = 3 ) distributions:
p-value for asymptotic test is between 0.1137434 and 0.2262917

----------------------------------------------------------------------------------------------------

         =====================
         Kolmogorov-Smirnov Test
         =====================

Alternative hypothesis: two.sided
Statistic: D = 0.2270218
p-value: p = 0.09078339

----------------------------------------------------------------------------------------------------

Interpretation: Chisquare Test
=======================
The Normal distribution specified may be appropriate for these data.

Interpretation: K-S Test
==================
The Normal distribution specified may be appropriate for these data.

# Interpretation:
# Once again, there is no evidence that the Normal distribution with dispersion disp2=44.6
# is an inappropriate model.
# Again, inspection of the printed table suggests that the data might be under-represented in the lower
# tails of the Normal distribution, and over-represented in the upper tails; but the evidence is not
# statistically significant.



# Finally try the same experiments with the Chisquare distribution.
# No estimate of dispersion is needed, so we go straight into test.fit with n.param.est=0.

> test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="chisq",
n.param.est=0, central=F)

         ===============================
         Chisquare Test: Non-centralized Regions
         ===============================

Lower and Upper denote the probability region bounding the category.

Actual.count is the number of observations that lie outside the
lower% region of their distribution, but inside the upper% region.

Category Lower Upper Expected.count Actual.count
1 0 25 7.5 1
2 25 50 7.5 9
3 50 75 7.5 13
4 75 100 7.5 7

Likelihood ratio chisquare test statistic: 12.58729

Compare test statistic against Chisquare( df = 3 ) distribution:
p-value for asymptotic test is 0.005619703

----------------------------------------------------------------------------------------------------

         =====================
         Kolmogorov-Smirnov Test
         =====================

Alternative hypothesis: two.sided
Statistic: D = 0.2416348
p-value: p = 0.06019764

----------------------------------------------------------------------------------------------------

Interpretation: Chisquare Test
=======================
The Chisquare distribution specified is probably not appropriate for these data.

Interpretation: K-S Test
==================
The Chisquare distribution specified may be appropriate for these data.

# Interpretation:
# This time, it looks unlikely that the specified distribution is correct.
# The Chisquare test gives a p-value of about 0.005 and the table shows clear
# discrepancy between expected and actual counts.
# The K-S test did not detect a statistically significant departure, but the Chisquare
# test provides enough evidence to make it clear that we would be unwise to proceed
# with the Chisquare distribution as our model.


Last updated:  14th February 2003