close this bookVisual statistics with David J. Krus
source ref: ebookvsm.html
close this folderPart 2:Principles of Visual Statistics
View the documentChapter 7:Covariance
View the documentChapter 8:Correlation Analysis
View the documentChapter 9:Correlation: Interpretations
View the documentChapter 10:Correlation: Assumptions and Limitations
View the documentChapter 11:Regression: Structural Assumptions
View the documentChapter 12:Regression: Approximation of Assumed Structures
View the documentChapter 13:Regression Analysis
View the documentChapter 14:Regression on Categories
View the documentChapter 15:Point Biserial Coefficient of Correlation
View the documentChapter 16:Regression on Two Categories
View the documentChapter 17:The Phi Coefficient of Correlation
View the documentChapter 18:Sampling Theory
View the documentChapter 19:The z Square Ratio
View the documentChapter 20:Statistical Significance: Estimation
View the documentChapter 21:Statistical Significance: Assumptions
View the documentChapter 22:Single Classification Analysis of Variance
View the documentChapter 23:Double Classification Analysis of Variance

Chapter 19:The z Square Ratio

The z Square Ratio

Correlation, Regression, and Tests of Statistical Significance

Historically, the concepts of correlation and regression preceded the tests of statistical significance. The transition from the deductive models of data descriptions, typified by the correlation and regression models, into inferential models, typified by the tests of significance, was facilitated by that part of the regression theory which underlies the conceptualization of the point biserial coefficient of correlation. The point biserial correlation is conceptually important, as it helps to understand the main principles of the tests of statistical significance, especially how the coefficient of correlation can be used to measure a difference between two means.   

Coefficients of Determination and Alienation

The knowledge we have learned from regression on two categories can be easily applied to the test of the difference between two means. Consider the specification equation for the regression analysis

Forming a ratio of the predictable to the total variance, defines the coefficient of determination

 and its one's complement defines the coefficient of alienation.

The Gamma Square Ratio

One may also form a ratio of predicted to error variance, which is called the gamma square ratio.

  • In formal notation, using its point-biserial form, the gamma square ratio can be also written as


  • In terms of coefficients of determination and alienation, the gamma square ratio can be formed as

However, the gamma square ratio can be simply conceptualized as an information-error ratio. 

The information-error ratio is primarily a theoretical concept that serves to highlight the fundamental dichotomy between the notion of the statistical significance and the strength of its corresponding relationship.


The z Square Ratio

Now we will shift our focus from the correlational models, contrasting the variance determined with the alienated variance, to the models testing for statistical significance, content with assertion that the result is unlikely to be random if sufficient number of subjects tends to respond in the hypothesized direction.

The information-error ratio can be written as



Divide the denominator of the above equation by N (total sample size) and a z square ratio can be formed. 

Note that the gamma square ratio is independent of N. However, the z square ratio is weighted by the N.

We can also express the z square ratio in terms of the coefficients of determination and alienation.

The above formula is prototypical formula of all tests of statistical significance. The square root of the obtained z square ratio can be associated with its corresponding area under the normal distribution and its significance can be interpreted in terms of its probability densities.

The formula for the z-test using group means and variances is

It provides a link between the correlational methods () and statistical methods estimating the probability that differences between two means ( ) are large enough to be statistically significant.

The Differential Diagnosis Study

Randomly select two groups of patients, diagnosed with different diseases. Their body temperatures are recorded as shown below.

Regression Framework for the Differential Diagnosis Study

The primary function of the general linear model is to predict outcomes if variables related to these outcomes are known and can be quantified. For the example, our task is to predict body temperatures from the type of disease patients are suffering from. 

Data Set

First, create a coded predictor variable X to index the type of disease (Disease 0 and Disease 1). Second, combine body temperatures from two groups into one total group.


The results of a regression analysis can be presented below

Examine the Variances of Y, Y' and Y^

Total Variance

In the absence of any relevant information, the best prediction of the outcome is the mean. The overall mean of the total group is 101 and the total variance is 1.667. 

Predictable Variance

Add a predictor variable X. With group membership information, we are able to compute the predicted group means. The predicted values are shown below

The variance due to different group means (type of disease) is equal to 1.00.  

Error Variance

The error variable Y^ is computed as Y - Y' . The variance which can not be explained by the predictor X is called error variance and the error variance is equal to .67.

Percentage of Variance

Divide the variance components by the total variance of 1.67. Approximately 60 percent (1/1.67 = .598) of the variance in body temperature was accounted for by knowing the type of disease. Approximately 40 percent (.67/1.67=.40) of the variance in body temperature was likely due to other, unidentifiable factors.

The Two-Sample z Test 

We have learned that the mean body temperature was 100 degrees Fahrenheit for patients diagnosed with Disease 0  and 102 degrees Fahrenheit for Disease 1. Is there a significant difference in body temperature between the two groups of patients?

Note that our example is just an example used to explain the basic principles of significance testing.  In most real-life studies, the t-test is generally preferred over the z-test. The t-test is an analogue of the z-test where the degrees of freedom replace the n and the t-distribution replaces the normal distribution.

The z Test

1. Form an information-error ratio

The coefficient of determination for variables X and Y is .60 and the coefficient of alienation is .40. Thus, t
he information-error ratio is .60/.40 = 1.5

2. Form a z square ratio


(1) The z Square Ratio

For the example, the total sample size (N) is (3+3). The z square ratio can be computed as (.60/.40)(3+3) which equals 9.

(2) The z Ratio

Take the square root of the z square ratio. z = 3. The z value can then be associated with its corresponding area under the normal distribution. 

3. Probability statement

Locate the position of the obtained z value in the standard normal distribution. The probability associated with the z-value below -3 and the z-value above 3 is .0032 (.0016 + .0016 = .0032). 

The observed probability is less than .05. The researcher would declare the result to be statistically significant.

4. Report the results

A two-sample z test was conducted to determine whether there was a significant difference in body temperature between the group of patients diagnosed with Disease 0 and the group of patients diagnosed with Disease 1. The mean body temperature was 100 degrees Fahrenheit (SD = 1.00) for patients diagnosed with Disease 0  and 102 degrees Fahrenheit  (SD = 1.00)  for Disease 1. The difference between the means was statistically significant at the .05 significance level, z = 3, p < .05. About 60 percent of the variance in body temperature was accounted for by knowing the type of disease.

Best Known Statistical Fallacy in a Nutshell

Examine the z square ratio. 

The range of the coefficients of determination and alienation is from zero to one. The numerator of this fraction is multiplied by N with the theoretical range from 3 to infinity. This is the best known statistical fallacy in a nutshell. Given sufficiently large N, it is possibly to provide support to nearly anything because larger z values are more likely to find the relationships to be statistically significant.

Strength vs. Significance

This property of the z ratio lends credence to the Hays' 'testmanship' dictum which states that with sufficiently large N virtually any difference between compared means becomes statistically significant. Hays' observations in this respect are worth a verbatim quote.

"There is surely noting on earth that is completely independent of anything else. The strength of an association may approach zero, but it should seldom or never be exactly zero. If one applies a large enough sample of the study of any relation, trivial or meaningless as it may be, sooner or later a significant result will almost certainly be achieved. This kind of problem occurs when too much attention is paid to the significance tests and too little to the degree of statistical association the finding represents. This clutters up the literature with findings that are often not worth pursuing and which serve only to obscure the really important predictive relations that occasionally appear."  

The conclusion favored by Fred Kerlinger when addressing this topic is that 'the strength of a relationship is of primary importance; the significance of this relationship is ancillary to the question of how much of the variance has been accounted for.'

Effect Size For The z Square Ratio

The strength of the relationship, often called the effect size, underlying the z square ratio, can be obtained by algebraically reversing the formula


Multiplying both sides of the above equation by the  term and at the same time simplifying the left hand expression as


moving the expressions containing r toward the left side of the equation and the z-square expression toward the right


and by factoring the r on the left side of the equation


the formula for the effect size of the z square ratio can be obtained as

For the example of the elevated temperature study, the strength of the relationship can be obtained from the z-square ratio as 9/(9 + 6) = .60.  

Girl in the Well

Some time ago, a girl was thrown into a well, and stoned to death by a newly paroled convict. In the chain of events leading up to this finale, the question of the priority of the strength of a relationship versus its significance played a role. The parole, ending after only two days in a rape and murder, was granted on the basis of a recommendation made by a computerized testing program.

Based on over 20,000 cases, the relationship between test results favoring parole and successful rehabilitation beyond a five year period was statistically significant (z = 41.7, p < .001). Recomputing the strength of this relationship by using the formula

returned a coefficient of determination equal to .08, indicating that the test accounted for only 8 percent of the variance, leaving 92 percent of the variance for determining the outcome of parole unaccounted. There are many social science researchers opposed to the use of the coefficient of determination and indices based on this index, such as , to judge the relevance of research in the social sciences. Who, then, will speak for the girl in the well?


The formulae for the z square ratia and the index expressing the corresponding strength of relationship are summarized as shown below.


to previous section to next section