# GAITD Regression Homepage

Welcome to the GAITD regression homepage. Some resources are placed here to help users explore and become familiar with the methodology.

### Notices

• Please be patient as more material is being added all the time!

### Resources

• The main paper is "Generally altered, inflated, truncated and deflated regression" by Yee, T. W. and Ma, C., Statistical Science (in press). The software is available in the VGAM R package on CRAN. Please see GAITD_Tips.pdf for what parts of the package need to be mastered. Some example files: gaitdexample1.R, gaitdexample2.R.
• Looking for some tips and a strategy for GAITD regression? Here is a short PDF summary. It was excerpted from the supplementary files of a manuscript currently in preparation.
• Here is the shiny app for you to interactively explore the GAITD 'combo' probability mass function. Note that VGAM 1.1-10 on CRAN fixes some bugs in the functions and this version is used by the app. Thanks to Simon Urbanek for setting things up in RCloud.

### FAQ

• What is GAITD regression? The method combines alteration (hurdle model), inflation, deflation, truncation into one super mixture model. Models such as the zero-inflated Poisson (ZIP) and zero-altered Poisson (ZAP) and zero-truncated binomial (ZTB) are well known and special cases. They treat 0 as the sole special value. GAITD regression allows many special values, and some of them are altered, others inflated or deflated, and others truncated. Furthermore, all but truncation come in two forms: parametric and nonparametric. The latter means there is no pattern or structure; they happen to be there as they are and it doesn't fit into any particular pattern. Parametric means they come from the same distribution as the parent or base distribution.

Here is an example. The Flamingo Hotel is located on the beach in the southern Sardinia, about 40 kilometers from Cagliari. Of interest is the length of stay (LOS). The spikeplot on the right shows spikes at 7 and 14 days. Two reasons for this are obvious: people book in units of weeks for convenience and take advantage of weekly specials such as staying 7 nights for the cost of 6. A long-tailed distribution such as the logarithmic or zeta could be used to model the parent and the spikes by parametric inflation. Probably one of values 1 and 2 need to be adjusted for the other.

• GAITD regression is for count responses. What about for continuous responses? Currently extension to some continuous distibutions are being made. I hope to report back with some theory and software soon.
• How can I choose between alteration, inflation and deflation? Two suggestions: firstly, what exactly is the research question? While generally-altered regression explains why observations are there, generally-inflated regression accounts for why they are there in excess, and generally-deflated regression explains why observations are not there. Secondly, can the spikes/dips be explained? Sometimes the analyst knows the reason for the feature, for example, the dip is because certain types of people are missing. In this case, deflation would be better than alteration. Please see GAITD_Tips.pdf for more information.
• Why is GAITD regression useful for heaped data? Heaping is a form of measurement error. For example, ask a smoker 'how many cigarettes did you smoke last week?' and often multiples of 5 are a more common response. If the spikes have a similar distribution as the other values, one might assume they come from same distribution with possibly different parameter values. They are 'added' on top of the main distribution by inflation. GAITD regression allows estimation of the underlying distibution(s). And if the immediately adjacent values to the heaping points are examined, often they are noticeably less frequent than the main distribution. They are called seeped values and deflation may be the best way to handle these. Instead of spikes, they are called dips (maybe holes).

Here is an example. The figure on the left is a plot of the self-reported ages at which smokers quit their habit. The data come from a large cross-sectional health study from the mid-1990s in New Zealand. Many quitters report an age that is a multiple of 5, such as 30, 35, 40, 45, 50. Values such as 29 and 31 are seeped. Many of the spikes can be treated as coming from a (second) negative binomial distribution, which is a suitable parent. It transpires that a GI-NBD can give a reasonable fit. In general, heaping is a common problem in surveys where self-reported data is collected. There are many examples from the literature, e.g., income, age, household expenditure, working hours. Most respondents know approximately the true value so that the response is contaminated by measurement error. GAITD regression holds promise for heaped and seeped data.

• Are there any dangers fitting a GAITD regression model? Because the combo PMF is extremely flexible, overfitting is the biggest problem to avoid. The most common mistake novices make, a suggestion is: don't sweat the small stuff. A slightly underfitted model is better than a slightly overfitted model.
• What parent distribution are implemented in software? Currently four. They are the Poisson, logarithmic, zeta and negative binomial. Once the bugs have been ironed out, it is planned that more distribution be added.
• Why allow general truncation? The zero-truncated (positive) Poisson distribution is well known, and applied to data such as crowd sizes and number of items purchased at a supermarket. On the right is a photo inside a lift in Shanghai. If you were sampling from floors that people exited from, certain values would be structurally missing. Can you see which ones? There are good reasons, at least for two values: fear of 4 and 13 are called tetraphobia and triskaidekaphobia respectively. You never know when you might encounter a distribution whose support does not include certain values different from 0. Being able to truncate any set of values between 0 and infinity in unit steps is theoretically possible with GAITD regression. Of course, it is possible to include truncating the upper and lower tails of the distribution.
• I emailed Thomas Yee for help but did not get a reply. Is there a reason? Because of the large volumes of emails, please only send in bug reports. A short reproducible example is good for these. Thanks.