Colour coding:

# comments

> user commands

computer output

# First follow the procedure outlined here to make the functions
visible from within *R*.

# Source the functions and example data into *R* :

> source("est.dispersion")

> source("test.fit")

> pastdata <- read.table("pastdata", header=T)

# view the example past data:

> pastdata

name | estimated | observed |

interpreter | 40 | 36 |

photoedit | 20 | 23 |

insurecalc | 10 | 16 |

fingerprints | 20 | 15 |

graphologist | 20 | 35 |

archivist | 40 | 50 |

etc.

# Interpretation of the columns:

# *name:* the name of project *i* .

# *estimated:* estimated total time, in person-days, required for
completion of project *i*
from start to finish.

# *observed:* observed total time, in person-days, required for completion
of project *i*
from start to finish.

# The *estimated* times could be obtained either from guesswork before the projects started,

# or as the output from a predictive model.

# The projects have been completed, so the observed times are known.

# The EST.DISPERSION function is most appropriate when the estimated efforts were

# obtained by guesswork before the projects started. If they were obtained as the output

# from a regression model, the dispersion would normally be estimated as part of the model,

# rendering separate estimation unnecessary.

# Apply the EST.DISPERSION function to the past projects in *pastdata*.

# The syntax is:

#

#
est.dispersion(obs, est, distname)

#

# *obs* is the set of observed times (here extracted as *pastdata$observed*);

# *est* is the set of estimated times (here extracted as *pastdata$estimated*);

# *distname* is the selected distribution name (must be in quotes).

# *distname* should be one of: "norm" (Normal), "gamma" (Gamma), "lnorm" (Lognormal),

# or "lnorm.med" (Lognormal, median specification).

# If the Poisson or Chisquare distributions are to be used, no estimate of dispersion is necessary.

# *dispersion* is the variance for the Normal option, the variance/(mean squared) for the Gamma

# and Lognormal options, and the variance/(median squared) for the Lognormal-median option.

# EXAMPLE:

# estimate the dispersion for the example data *pastdata*,

# assuming that the effort of each project is Gamma with common unknown dispersion,

# and mean given by the *esimated* column in *pastdata*.

# Store the result with name *disp*.

> disp <- est.dispersion(obs=pastdata$observed, est=pastdata$estimated, distname="gamma")

Estimated dispersion: 0.0459748

# Now apply the TEST.FIT function to check whether this distribution is an adequate

# fit to the past data. The function performs Chisquare and Kolmogorov-Smirnov

# goodness-of-fit tests.

# The hypothesized distribution for project *i* is *distname* with mean *est[i]*
and dispersion as specified.

# The observation is *obs[i]* .

# For the Chisquare test, observations are grouped according to certain specified boundaries,

# and the number of observations in each grouping is compared with that expected if the

# hypothesized distribution is correct.

# For the Kolmogorov-Smirnov (K-S) test, the observations are transformed into a Uniform(0, 1)

# distribution and the empirical distribution function is compared against the Uniform(0, 1)

# distribution function.

# The syntax is:

#

#
test.fit(obs, est, probs, distname, dispersion, n.param.est, central)

#

# *obs*, *est*, and *distname* are as above.

# The options "pois" (Poisson) and "chisq" (Chisquare) are now allowed for distname,

# in addition to "norm", "gamma", "lnorm", and "lnorm.med".

# *dispersion* is the dispersion, if appropriate, usually estimated from the data as above.

# *probs* specifies probability regions by which the the observations *obs* are
grouped for the Chisquare test:

# e.g. if *probs*=c(0.25, 0.5, 0.75) (the default), then the first grouping comprises observations

# that lie in the first 25% probability region of their hypothesized distribution;

# the second grouping comprises those outside the first 25% region but inside the first 50% region,

# and so on.

# If *central*=TRUE, the probability regions are centered on the distribution median, so

# the first 25% probability region is the 25% of the distribution closest to the median.

# If *central*=FALSE (the default), the probability regions are not centered, so that the first 25%

# probability region is the lower 25% tail, and so on.

# It is generally better to use *central*=FALSE, because centralizing means that a surplus in the

# upper tail can cancel out a deficit in the lower tail, so a bad fit might not be detected.

# *n.param.est* specifies the number of parameters that have been estimated from the data,

# and it must be supplied even if 0.

# If the estimated values *est* were derived by guesswork only, then *n.param.est*
may be 0 for the Poisson and

# Chisquare distributions, and 1 for the Normal, Lognormal, and Gamma distributions if dispersion

# was estimated from the data.

# If the estimated values *est* were derived from the data, add one to *n.param.est*
for each parameter involved:

# e.g. add one to *n.param.est* if *est* = mean(*obs*),

# or add two to *n.param.est* if *est* is the output from a regression model
with two parameters, as in

# *est* = *alpha* + *beta* * (some measurement).
(Here *alpha* and *beta* are the parameters).

THEORETICAL DETAILS:

# Reference: Kendall, M.G. and Stuart, A. (1979)
Advanced Theory of Statistics, Vol 2, 4th Edition,

# sections 30.1 to 30.34.

# Function TEST.FIT uses the likelihood ratio test statistic rather than the Pearson X^2:

# the likelihood ratio statistic should be more accurate although it is
less commonly used (see ref).

# The asymptotic distribution of the test statistic is bounded between the
Chisquare distribution

# with df = (ngroups - n.param.est - 1) and the
Chisquare distribution with df = (ngroups - 1)

# (see ref).

# This means we need to find the p-value for both of these distributions, and the output from the

# test is often vague.

# In addition, a reasonably large sample is needed for the asymptotic distribution to apply:

# this function rather arbitrarily gives a warning about sample sizes less than 25.

# The interpretation of the p-values given by the function is intentionally vague, and the tests

# should be regarded as a guideline, not a definite answer.

# The smaller the p-values, the less likely it is that the hypothesized distribution is correct;

# but exactly how small is small is arguable.

# EXAMPLE:

# Test the Gamma distribution, with dispersion estimated by *est.dispersion* above and stored
in *disp*,

# on the *pastdata* data set. The estimated efforts were gained by guesswork, so only the dispersion

# has been estimated from the data and *n.param.est*=1.

>
test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="gamma",

dispersion=disp, n.param.est=1, central=F)

===============================

Chisquare Test: Non-centralized Regions

===============================

Lower and Upper denote the probability region bounding the category.

Actual.count is the number of observations that lie outside the

lower% region of their distribution, but inside the upper% region.

Category | Lower | Upper | Expected.count | Actual.count |

1 | 0 | 25 | 7.5 | 4 |

2 | 25 | 50 | 7.5 | 6 |

3 | 50 | 75 | 7.5 | 10 |

4 | 75 | 100 | 7.5 | 10 |

Likelihood ratio chisquare test statistic: 3.800691

Compare test statistic against both Chisquare( df = 2 ) and Chisquare( df = 3 )
distributions:

p-value for asymptotic test is between 0.1495170 and 0.2838058

----------------------------------------------------------------------------------------------------

=====================

Kolmogorov-Smirnov Test

=====================

Alternative hypothesis: two.sided

Statistic: D = 0.1951866

p-value: p = 0.2031577

----------------------------------------------------------------------------------------------------

Interpretation: Chisquare Test

=======================

The Gamma distribution specified may be appropriate for these data.

Interpretation: K-S Test

==================

The Gamma distribution specified may be appropriate for these data.

# Interpretation:

# There is no evidence from these tests that the Gamma distribution with this dispersion

# *disp*=0.0459748 is an inappropriate model.

# The p-values all lie between about 0.14 and 0.28. Visual inspection of the observed counts and

# expected counts in the table indicate a possible bias in the upper tails, but there is not enough

# evidence for this to be statistically significant.

# Now try the same experiments with the Normal distribution.

# First estimate dispersion:

> disp2 <- est.dispersion(obs=pastdata$observed, est=pastdata$estimated, distname="norm")

Estimated dispersion: 44.6

# Now test the fit:

>
test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="norm",

dispersion=disp2, n.param.est=1, central=F)

===============================

Chisquare Test: Non-centralized Regions

===============================

Lower and Upper denote the probability region bounding the category.

Actual.count is the number of observations that lie outside the

lower% region of their distribution, but inside the upper% region.

Category | Lower | Upper | Expected.count | Actual.count |

1 | 0 | 25 | 7.5 | 3 |

2 | 25 | 50 | 7.5 | 9 |

3 | 50 | 75 | 7.5 | 9 |

4 | 75 | 100 | 7.5 | 9 |

Likelihood ratio chisquare test statistic: 4.34762

Compare test statistic against both Chisquare( df = 2 ) and Chisquare( df = 3 )
distributions:

p-value for asymptotic test is between 0.1137434 and 0.2262917

----------------------------------------------------------------------------------------------------

=====================

Kolmogorov-Smirnov Test

=====================

Alternative hypothesis: two.sided

Statistic: D = 0.2270218

p-value: p = 0.09078339

----------------------------------------------------------------------------------------------------

Interpretation: Chisquare Test

=======================

The Normal distribution specified may be appropriate for these data.

Interpretation: K-S Test

==================

The Normal distribution specified may be appropriate for these data.

# Interpretation:

# Once again, there is no evidence that the Normal distribution with dispersion *disp2*=44.6

# is an inappropriate model.

# Again, inspection of the printed table suggests that the data might be under-represented in the lower

# tails of the Normal distribution, and over-represented in the upper tails; but the evidence is not

# statistically significant.

# Finally try the same experiments with the Chisquare distribution.

# No estimate of dispersion is needed, so we go straight into *test.fit* with *n.param.est*=0.

>
test.fit(obs=pastdata$observed, est=pastdata$estimated, probs=c(0.25, 0.5, 0.75), distname="chisq",

n.param.est=0, central=F)

===============================

Chisquare Test: Non-centralized Regions

===============================

Lower and Upper denote the probability region bounding the category.

Actual.count is the number of observations that lie outside the

lower% region of their distribution, but inside the upper% region.

Category | Lower | Upper | Expected.count | Actual.count |

1 | 0 | 25 | 7.5 | 1 |

2 | 25 | 50 | 7.5 | 9 |

3 | 50 | 75 | 7.5 | 13 |

4 | 75 | 100 | 7.5 | 7 |

Likelihood ratio chisquare test statistic: 12.58729

Compare test statistic against Chisquare( df = 3 ) distribution:

p-value for asymptotic test is 0.005619703

----------------------------------------------------------------------------------------------------

=====================

Kolmogorov-Smirnov Test

=====================

Alternative hypothesis: two.sided

Statistic: D = 0.2416348

p-value: p = 0.06019764

----------------------------------------------------------------------------------------------------

Interpretation: Chisquare Test

=======================

The Chisquare distribution specified is probably not appropriate for these data.

Interpretation: K-S Test

==================

The Chisquare distribution specified may be appropriate for these data.

# Interpretation:

# This time, it looks unlikely that the specified distribution is correct.

# The Chisquare test gives a p-value of about 0.005 and the table shows clear

# discrepancy between expected and actual counts.

# The K-S test did not detect a statistically significant departure, but the Chisquare

# test provides enough evidence to make it clear that we would be unwise to proceed

# with the Chisquare distribution as our model.