Correlation: Tetrachoric model

In the model underlying tetrachoric correlation it is assumed that the frequency data in a 2 × 2-table stem from dichotimizing two continuous random variables X and Y that are bivariate normally distributed with mean m = (0, 0) and covariance matrix:

correlations-tetrachoric-correlation-e01.png

For instance, one could be interested in correlating the responses to two yes-no questionnaire items, each of which relates to an underlying normally distributed variable (e.g., product preferences).

The value ρ in the above (normalized) covariance matrix is the tetrachoric correlation coefficient. The frequency data in the table depend on the criterion used to dichotomize the marginal distributions of X and Y and the tetrachoric correlation coefficient.

From a theoretical perspective, a specific scenario is completely characterized by a 2 × 2 probability matrix:

correlations-tetrachoric-correlation-e02.png

where the marginal probabilities are regarded as fixed. The whole matrix is completely specified if the marginal probabilities p⋆2 = Pr(X = 1), p2⋆ = Pr(Y = 1) and the table probability p11 are given. If z1 and z2 denote the quantiles of the standard normal distribution corresponding to the marginal probabilities p⋆1 and p1⋆, that is, Φ(zx) = p⋆1 and Φ(zy) = p1⋆, then p11 is the CDF of the bivariate normal distribution described above, with the upper limits zx and zy:

correlations-tetrachoric-correlation-e03.png

where Φ(x, y, r) denotes the density of the bivariate normal distribution. The upper limits zx and zy are the values at which the variables X and Y are dichotomized.

Observed frequency data are assumed to be random samples from this theoretical distribution. Thus, it is assumed that random vectors (xi, yi) have been drawn from the bivariate normal distribution described above that are afterwards assigned to one of the four cells according to the 'column' criterion xizx versus xi > zx and the 'row' criterion yizy versus yi > zy. Given a frequency table

correlations-tetrachoric-correlation-e04.png

The central problem in testing specific hypotheses about tetrachoric correlation coefficients is to estimate the correlation coefficient and its standard error from frequency data.

Two approaches  to solving this problem have be proposed. One apporach is to estimate the exact correlation coefficient (e.g. Brown & Benedetti, 1977). The other approach is to use simple approximations ρ* of ρ that are easier to compute (e.g. Bonett & Price, 2005). G*Power provides power analyses for both approaches. (See, however, the implementation notes for a qualification of the term 'exact' used to distinguish between both approaches.)

The exact computation of the tetrachoric correlation coefficient is difficult. One reason is computational in nature (see the implementation notes below). A more principal problem is, however, that frequency data are discrete, which implies that the estimation of a cell probability can be no more accurate than 1/(2N). The inaccuracies in estimating the true correlation ρ are especially severe when there are cell frequencies smaller than 5. In these cases, caution is warranted when interpreting the estimated r. For a more thorough discussion of these issues see Brown and Benedetti (1977) and Bonett and Price (2005).

Testing the tetrachoric correlation coefficient

The implemented routines estimate the power of a test that the tetrachoric correlation ρ has a fixed value ρ0. That is, the null and alternative hypothesis for a two-sided test are

H0 : ρ − ρ0 = 0
H1 : ρ − ρ0 ≠ 0.
The hypotheses are identical for both the exact and the approximation mode.

In the power procedures the use of the Wald test statistic: W = (r − ρ0)/se0(r) is assumed, where se0(r) is the standard error computed at ρ = ρ0.

As will be illustrated in the example section, the outputs of G*Power may be also be used to perform the statistical test.

Effect size index

The correlation coefficient assumed under H1 (H1 corr ρ) is used as effect size. The following additional inputs are needed to fully specify the effect size:

H0 corr ρ

This is the tetrachoric correlation coefficient assumed under H0. An input of the type
H1 corr ρ = H0 corr ρ
corresponds to "no effect" and must not be used in a priori power calculations.

Marginal prop x.

This is the marginal probability that X > zx (i.e. p*2)

Marginal prop y.

This is the marginal probability that Y > zy (i.e. p2*)

The correlations must be within the interval ]− 1, 1[. The probabilities must be within the interval ]0, 1[.

Effect size calculation

The effect size drawer may be used to determine H1 corr ρ in two different ways.

A first possibility is to specify, for each cell of the 2 × 2 table, the probability of this event assumed under H1. Pressing the Calculate button calculates the exact (Correlation ρ) and approximate (Approx. correlation ρ∗) tetrachoric correlation coefficient, and the marginal probabilities Marginal prob x = p12 + p22, and Marginal prob y = p21 + p22. The exact correlation coefficient is used as H1 corr ρ (see below).

Note that the four cell probabilities must sum to 1. It therefore suffices to specify three of them explicitly. If you leave one of the four cells empty, G*Power computes the fourth value as: (1 - sum of three p).


correlations-tetrachoric-correlation-effect-size-drawer-1.png
A second possibility is to compute a confidence interval for the tetrachoric correlation in the population from the results of a previous investigation, and to choose a value from this interval as H1 corr ρ. In this case you specify four observed frequencies, the relative position 0 < k < 1 inside the confidence interval (0, 0.5, 1 corresponding to the left, central, and right position, respectively), and the confidence level (1 − α) of the confidence interval (see below).

From these data G*Power computes the total sample size N = f11 + f12 + f21 + f22 and estimates the cell probabilities pij by: pij = (fij + 0.5)/(N + 2). These are used to compute the sample correlation coefficient r, the estimated marginal probabilities, the borders (L, R) of the (1 − α) confidence interval for the population correlation coefficient ρ, and the standard error of r. The value L + (RL) ∗ k is used as H1 corr ρ.

The computed correlation coefficient, the confidence interval, and the standard error of r depend on whether the exact (Brown & Benedetti, 1977) or the approximate (Bonett & Price, 2005) computation method was chosen in the Options dialog in the main window. In the exact mode, the labels of the output fields are Correlation r, C.I. ρ lwr, C.I. ρ upr, and Std. error of r. In the approximate mode an asterisk ∗ is appended after r and ρ.

correlations-tetrachoric-correlation-effect-size-drawer-2.png
Clicking on the button Calculate and transfer to main window copies the values given in H1 corr ρ, Margin prob x, Margin prob y, and - in frequency mode - Total sample size to the corresponding input fields in the main window.

Options

You can choose between the exact approach in which the procedure proposed by Brown and Benedetti (1977) is used and the approximation suggested by Bonett and Price (2005).

Examples

To illustrate the application procedure we refer to Example 1 in Bonett and Price (2005). The Yes or No answers of 930 respondents to two questions in a personality inventory are recorded in a 2 × 2 table with the following result: f11 = 203, f12 = 186, f21 = 167, f22 = 374.

First we use the effect size dialog to compute from these data the confidence interval for the tetrachoric correlation in the population. We choose, in the effect size drawer, From C.I. calculated from observed freq. Next we insert the above values in the corresponding fields and press Calculate. Using the exact computation mode (selected in the Options dialog in the main window), we get an estimated correlation of r = 0.334, a standard error of r = 0.0482, and a 95% confidence interval of [0.240, 0.429] for the population ρ. We choose the left border of the C.I. (i.e. relative position 0, corresponding to 0.240) as the value of the tetrachoric correlation coefficient ρ under H0.

We now want to know how many subjects we need to a achieve a power of 0.95 in a one-sided test of the H0 that ρ = 0 vs. the H1 that ρ = 0.24, given the same marginal probabilities and α = 0.05.

Clicking on Calculate and transfer to main window copies the computed H1 corr ρ = 0.2399846 and the marginal probabilities px = 0.6019313 and py = 0.5815451 to the corresponding input fields in the main window. The complete input and output is as follows:

Select

Type of power analysis: A priori

Input

Tail(s): One
H1 corr ρ:  0.2399846
α err prob: 0.05
Power (1-β err prob): 0.95
H0 corr ρ: 0
Marginal prob x: 0.6019313
Marginal prob y: 0.5815451

Output

Critical z: 1.644854
Total sample size: 463
Actual power: 0.950370
H1 corr ρ: 0.239985
H0 corr ρ: 0.0
Critical r lwr: 0.122484
Critical r upr: 0.122484
Std err r: 0.074465

This shows that we need at least a sample size of 463 in this case (the Actual power output field shows the power for a sample size rounded to an integer value).

The output also contains the values for ρ under H0 and H1 used in the internal computation procedure. In the exact computation mode a deviation from the input values would indicate that the internal estimation procedure did not work correctly for the input values (this should only occur for extreme values of r or marginal probabilities). In the approximate mode, the output values correspond to the r values resulting from the approximation formula.

The remaining outputs show the critical value(s) for r under H0: In the Wald test assumed here, z = (r − ρ0)/se0(r) is approximately standard normally distributed under H0. The critical values of r under H0 are given
  1. as a quantile z1 − α/2 of the standard normal distribution, and
  2. in the form of critical correlation coefficients r and standard error se0(r). (In one-sided tests, the single critical value is reported twice in Critical r lwr and Critical r upr). In the example given above, the standard error of r under H0 is 0.074465, and the critical value for r is 0.122484. Thus, (r − ρ)/se(r) = (0.122484 − 0)/0.074465 = 1.64485 = z1−α, as expected.

Using G*Power to perform the statistical test of H0

G*Power may also be used to perform the statistical test of H0. Assume that we want to test the
H0: ρ = ρ0 = 0.4 against
H1: ρ ≠ 0.4
for α = 0.05. Assume further that we observed the following frequencies:
f11 = 120,
f12 = 45,
f21 = 56, and
f22 = 89.
To perform the test we first open the effect size drawer and select the From C.I. calculated from observed freq option. Here we compute from the observed frequencies the correlation coefficient r and the estimated marginal probabilities. In the exact mode we find
Correlation r = 0.512751,
Est. marginal prob x = 0.4326923, and
Est. marginal prob y” = 0.4679487.
In the main window we then choose a Post hoc type of power analysis. Clicking on Calculate and transfer to main window in the effect size drawer copies the values for marginal x, marginal y, and the sample size 310 to the main window. We now set
Tail(s) = Two
H0 corr ρ = 0.4 and
α err prob = 0.05.
After clicking on Calculate in the main window, the output section shows the critical values for the correlation coefficient ([0.244446, 0.555554]) and the standard error under H0 (0.079366). These values show that the test is not significant for the chosen α-level, because the observed r = 0.512751 lies inside the interval [0.244446, 0.555554]. We then use the G*Power calculator to compute the associated p value. Inserting
z = (0.513-0.4)/0.0794; 1-normcdf(z,0,1)
and clicking on the Calculate button yields p = 0.077.

If we instead want to use the approximate mode, we would choose the Options dialog in the main window and then choose Use approximation (Bonett and Price, 2005). We may then proceed in essentially the same way as described above. In this case we find a very similar value for the correlation coefficient r∗ = 0.5093278. The critical values for r∗ given in the output section of the main window are [0.233365, 0.540709] and the standard error for r∗ is 0.078882.

Note: To compute the p value in the Use approximation (Bonett and Price, 2005) mode, we should use H0 corr ρ∗ given in the output and not H0 corr ρ specified in the input. Accordingly, in the G*Power calculator we enter
z = (0.509-0.397)/0.0788; 1-normcdf(z,0,1)
which yields p = 0.0776, a value very close to that given above for the exact mode.

Related tests

Correlation: Bivariate normal model
Correlation: Point biserial model

Implementation notes

Given ρ and the marginal probabilties px and py, the following procedures are used to calculate the value of ρ (in the exact mode) or ρ* (in the approximate mode) and to estimate the standard error of r and r*.

Exact mode

In the exact mode the algorithms proposed by Brown and Benedetti (1977) are used to calculate r and to estimate the standard error s(r). Note that the latter is not the expected standard error σ(r)! To compute σ(r) would require to enumerate all possible tables Ti for the given N. If p(Ti) and ri denote the probability and the correlation coefficient of table i, then σ2(r) = ∑i (ri − ρ)2 p(Ti) (see Brown and Benedetti, 1977, p. 349, for details). The number of possible tables increases rapidly with N. It is therefore in general computationally too expensive to compute this exact value. Thus, 'exact' does not mean that the exact standard error is used in the power calculations.

In the exact mode it is not necessary to estimate r in order to calculate power, because it is already given in the input. We nevertheless report the value of r calculated by the routine in the output to indicate possible limitations in the precision of the routine for |r| near 1. Thus, if the r's reported in the output section deviate markedly from those given in the input, all results should be interpreted with caution.

To estimate s(r) the formula based on asymptotic theory proposed by Pearson in 1901 is used:

correlations-tetrachoric-correlation-e05.png

or, with respect to cell probabilities,

correlations-tetrachoric-correlation-e06.png

where

correlations-tetrachoric-correlation-e07.png


Brown and Benedetti (1977) show that this approximation is quite good if the minimal cell frequency is at least 5 (see their Tables 1 and 2).

Approximation mode

Bonett and Price (2005) propose the following approximations.

Correlation coefficient

Their approximation ρ∗ of the tetrachoric correlation coefficient is:

ρ* = cos(π/(1 + ωc)),

where c = (1 − |p1* − p*1 |/5 − (1/2 − pm)2)/2, with p1*  = p11 + p12 , p*1 = p11 + p21, pm = the smallest marginal proportion, and ω = p11 p22 /(p12 p21).

The same formulae are used to compute an estimate r* from frequency data fij. The only difference is that estimates pij* = (fij + 0.5)/N of the true probabilities are used.

Confidence Interval

The 100 · (1 − α) confidence interval for r* is computed as follows:

CI = [cos(π/(1 + Lc* )), cos(π/(1 + Uc*))],

where

correlations-tetrachoric-correlation-e08.png


and zα/2 is the α/2 quartile of the standard normal distribution.

Asymptotic standard error for r*

The standard error is given by:

correlations-tetrachoric-correlation-e09.png

with

correlations-tetrachoric-correlation-e10.png
where

correlations-tetrachoric-correlation-e11.png

Power calculation

The H0 distribution is the standard normal distribution N(0, 1). The H1 distribution is the normal distribution with mean N(m1, s1), where

m1 = (ρ − ρ0)/sr0
s1 = sr1/sr0.

The values sr0 and sr1 denote the standard error under H0 and H1, that is, s0) and s(ρ) in the exact mode, s0*) and s(ρ*) in the approximation mode.

Validation

The correctness of the procedure used to calculate r, s(r) and r*, s(r*) was checked by reproducing the examples in Brown and Benedetti (1977) and Bonett and Price (2005), respectively. The soundness of the power routines were checked by Monte-Carlo simulations in which we found good agreement between simulated and predicted power.

References

Bonett, D. G., & Price, R. M. (2005). Inferential methods for the tetrachoric correlation coefficient. Journal of Educational and Behavioral Statistics, 30, 213-225.

Brown, M. B., & Benedetti, J. K. (1977). On the mean and variance of the tetrachoric correlation coefficient. Psychometrika, 42, 347-355.

    Freitag, 10. 02. 2012


gpicon-128.png

Questions about this website? Contact

Axel Buchner


Letzte Änderung: 29.06.2009, 16:43
Seitenende