As a Life Actuary primarily
associated with valuations and financial reporting positioned in a very much
underdeveloped insurance market, Data Analysis is an area that I never get to
experience but do get to hear about more than I would like to. Buzzwords like
“BIGDATA” and “R-Programming” further add to that sense of inferiority.
As my response to this growing
skillset deficiency, I have resolved to revisit statistics through Edward W.Frees book dedicated to Regression applications in Actuarial and Finance Areas with
the aid of R-software. First impressions of R so far have been that its
extremely convenient and quick to use with just a few lines of code needed for
all kinds of statistical work. As for the book, it’s extremely relatable to the
usual problems in Actuarial Science.
Although
not a stranger to regression analysis but not being in practice, actually
resulted in me being fascinated my simple things as scatter plot and qq plots. The
book teaches these concepts through a dataset aimed at determining the impact
on lottery sales in the state of Wisconsin. The dataset consists of average
lottery sales (SALES) over a 40-week period, April 1998 through January 1999,
from fifty randomly selected areas identified by postal (ZIP) code within the
state of Wisconsin. The example attempts to draw conclusions from a regression
based on Sales against population of the area.
A
scatter plot immediately is the first step towards visualizing correlations.
Adding a regression line is just as straightforward. The R2
statistic of 78.5% implies a great fit showing that a great degree of
variations (78.5% of total variance) in sales being explained by the regression
line.
However, it was upon reaching the part of
residual analysis that I really began to learn something. Residual
analysis is the exercise of checking the residuals for patterns.
![]() |
Scatter Plot (with Regression Line): Lottery Sales vs Population |
5 types of Model Misspecifications are
defined:
- Lack of Independence;
- Heteroscedasticity: When variability varies by observation;
- Relationships between Model Deviations and Explanatory Variables;
- Non-normal Distributions;
- Unusual Points.
“An observation that is unusual in the vertical direction is called an outlier. An observation that is unusual in the horizontal directional is called a high leverage point. An observation may be both an outlier and a high leverage point.”
The book introduced the concept of
Sensitivity testing in Regression Analysis. The unusual point is an outlier in
the data. The outlier ZIP CODE represents the city of village of Bristol in Kenosha
County. A very small village indeed with an estimated population of a little
over 5000 in 2016. The writer believes that residents in neighboring areas participated
in the lottery in this village causing sales to go up unusually.
![]() |
qq plots of the residuals before and after removal of the outlier |
There is a long way to go and I have barely just scratched the surface. However, my experience so far with R has been a very pleasant one and in the process I am eliminating some very embarrassing gaps in my knowledge with hands on practice.
No comments:
Post a Comment