|Visual statistics with Multimedia.by David J. Krus|
source ref: ebookvsm.html
|Part 2:Principles of Visual Statistics|
Correlation, Regression, and Tests of Statistical Significance
Historically, the concepts of correlation and regression preceded the tests
of statistical significance. The transition from the deductive models of data
descriptions, typified by the correlation and regression models, into inferential models,
typified by the tests of significance, was facilitated by that part of the regression
theory which underlies the conceptualization of the point biserial coefficient of
correlation. The point biserial correlation is conceptually
important, as it helps to understand the main principles of the tests of statistical
significance, especially how the coefficient of correlation can
be used to measure a difference between two means.
The knowledge we have learned from regression on two categories can be easily applied to the test of the difference between two means. Consider the specification equation for the regression analysis
Forming a ratio of the predictable to the total variance, defines the coefficient of determination
its one's complement defines the coefficient of alienation.
defines the coefficient of alienation.
The Gamma Square Ratio
One may also
form a ratio of predicted to error variance, which is called
the gamma square ratio.
notation, using its point-biserial form, the gamma square
ratio can be also written as
In terms of coefficients of determination and alienation, the gamma square ratio can be formed as
However, the gamma square ratio can be simply conceptualized as an information-error ratio.
The information-error ratio is primarily a theoretical concept that serves to highlight the fundamental dichotomy between the notion of the statistical significance and the strength of its corresponding relationship.
Now we will shift our focus from the correlational models, contrasting the variance determined with the alienated variance, to the models testing for statistical significance, content with assertion that the result is unlikely to be random if sufficient number of subjects tends to respond in the hypothesized direction.
The information-error ratio can be written as
Divide the denominator of the above equation by N (total sample size) and a z square ratio can be formed.
Note that the gamma square ratio is independent of N. However, the z square ratio is weighted by the N.
We can also express the z square ratio in terms of the coefficients of determination and alienation.
The above formula is prototypical formula of all tests of statistical significance. The square root of the obtained z square ratio can be associated with its corresponding area under the normal distribution and its significance can be interpreted in terms of its probability densities.
The formula for the z-test using group means and variances is
It provides a link between the correlational methods () and statistical methods estimating the probability that differences between two means ( ) are large enough to be statistically significant.
Randomly select two groups of patients, diagnosed with different diseases. Their body temperatures are recorded as shown below.
The primary function of the general linear model is to predict outcomes if variables related to these outcomes are known and can be quantified. For the example, our task is to predict body temperatures from the type of disease patients are suffering from.
First, create a coded predictor variable X to index the type of disease (Disease 0 and Disease 1). Second, combine body temperatures from two groups into one total group.
The results of a regression
analysis can be presented below
Examine the Variances of Y, Y' and Y^
In the absence of any relevant information, the best prediction of the outcome is the mean. The overall mean of the total group is 101 and the total variance is 1.667.
Add a predictor variable X. With group membership information, we are able to compute the predicted group means. The predicted values are shown below
The variance due to different group means (type of disease) is equal to 1.00.
The error variable Y^ is computed as Y - Y' . The variance which can not be explained by the predictor X is called error variance and the error variance is equal to .67.
Percentage of Variance
Divide the variance components by the total variance of 1.67. Approximately 60 percent (1/1.67 = .598) of the variance in body temperature was accounted for by knowing the type of disease. Approximately 40 percent (.67/1.67=.40) of the variance in body temperature was likely due to other, unidentifiable factors.
The Two-Sample z Test
We have learned that the mean body temperature was 100 degrees Fahrenheit for patients diagnosed with Disease 0 and 102 degrees Fahrenheit for Disease 1. Is there a significant difference in body temperature between the two groups of patients?
Note that our example is just an example used to explain the basic principles of significance testing. In most real-life studies, the t-test is generally preferred over the z-test. The t-test is an analogue of the z-test where the degrees of freedom replace the n and the t-distribution replaces the normal distribution.
1. Form an information-error ratio
The coefficient of determination for variables X and Y is .60 and the coefficient of alienation is .40. Thus, the information-error ratio is .60/.40 = 1.5
2. Form a z square ratio
(1) The z Square Ratio
For the example, the total sample size (N) is (3+3). The z square ratio can be computed as (.60/.40)(3+3) which equals 9.
(2) The z Ratio
Take the square root of the z square ratio. z = 3. The z value can then be associated with its corresponding area under the normal distribution.
3. Probability statement
Locate the position of the obtained z value in the standard normal distribution. The probability associated with the z-value below -3 and the z-value above 3 is .0032 (.0016 + .0016 = .0032).
The observed probability is less than .05. The researcher would declare the result to be statistically significant.
4. Report the results
A two-sample z test was conducted to determine whether there was a significant difference in body temperature between the group of patients diagnosed with Disease 0 and the group of patients diagnosed with Disease 1. The mean body temperature was 100 degrees Fahrenheit (SD = 1.00) for patients diagnosed with Disease 0 and 102 degrees Fahrenheit (SD = 1.00) for Disease 1. The difference between the means was statistically significant at the .05 significance level, z = 3, p < .05. About 60 percent of the variance in body temperature was accounted for by knowing the type of disease.
Examine the z square ratio.
The range of the coefficients of determination and alienation is from zero to one. The numerator of this fraction is multiplied by N with the theoretical range from 3 to infinity. This is the best known statistical fallacy in a nutshell. Given sufficiently large N, it is possibly to provide support to nearly anything because larger z values are more likely to find the relationships to be statistically significant.
Strength vs. Significance
This property of the z ratio lends credence to the Hays' 'testmanship' dictum which states that with sufficiently large N virtually any difference between compared means becomes statistically significant. Hays' observations in this respect are worth a verbatim quote.
"There is surely noting on earth that is completely independent of anything else. The strength of an association may approach zero, but it should seldom or never be exactly zero. If one applies a large enough sample of the study of any relation, trivial or meaningless as it may be, sooner or later a significant result will almost certainly be achieved. This kind of problem occurs when too much attention is paid to the significance tests and too little to the degree of statistical association the finding represents. This clutters up the literature with findings that are often not worth pursuing and which serve only to obscure the really important predictive relations that occasionally appear."
The conclusion favored by Fred Kerlinger when addressing this topic is that 'the strength of a relationship is of primary importance; the significance of this relationship is ancillary to the question of how much of the variance has been accounted for.'
The strength of the relationship, often called the effect size, underlying the z square ratio, can be obtained by algebraically reversing the formula
Multiplying both sides of the above equation by the term and at the same time simplifying the left hand expression as
moving the expressions containing r toward the left side of the equation and the z-square expression toward the right
and by factoring the r on the left side of the equation
the formula for the effect size of the z square ratio can be obtained as
For the example of the elevated temperature study, the strength of the
relationship can be obtained from the z-square ratio as 9/(9 + 6) = .60.
Some time ago, a girl was thrown into a well, and stoned to death by a newly paroled convict. In the chain of events leading up to this finale, the question of the priority of the strength of a relationship versus its significance played a role. The parole, ending after only two days in a rape and murder, was granted on the basis of a recommendation made by a computerized testing program.
Based on over 20,000 cases, the relationship between test results
favoring parole and successful rehabilitation beyond a five year period was statistically
significant (z = 41.7, p < .001). Recomputing the
strength of this relationship by using the formula
returned a coefficient of determination equal to .08,
indicating that the test accounted for only 8 percent of the variance, leaving 92 percent
of the variance for determining the outcome of parole unaccounted. There are many social
science researchers opposed to the use of the coefficient of determination and indices
based on this index, such as , to judge the relevance of research in the social sciences.
Who, then, will speak for the girl in the well?
The formulae for the z square ratia and the index expressing the
corresponding strength of relationship are summarized as shown below.