In regard to my prior "temp" post, I wrote so much in my NASP listserv response that I have decided to take my email response, correct a few typo's, and post it now as blog post. I may return to this later to write a lengthier IAP 101 Research Brief or report.
Psychologists who engage in intelligence testing frequently compare subtest scores to determine if they are statistically and practically different...as part of the clinical interpretation process. Most IQ test publishers provide sound statistical procedures (tables or software for evaluating the statistical difference of two test scores; confidence band comparison rules-of-thumb).
However, traditional and clinical lore has produced a common "rule-of-thumb" that is problematic. The typical scenario is when a clinician subtracts two test SS's (M=100; SD=15) and invokes the rule-of-thumb that the difference needs to be 15 SS points (1 SD) or 22/23 points (1.5 SD). This is not correct.
SS difference scores do NOT have an SD scale of 15! When you subtract two SS's (with mean=100; SD=15) the resultant score distribution has a mean of zero and an SD that is NOT 15 (unless you transform/rescale the distribution to this scale) The size of the difference SD is a function of the correlation between the two measures compared
The SD(diff) is the statistic that should be used, and there are a number of different forumla for computing this metric. The different SD(diff)'s differ based on the underlying question or assumptions that is the basis for making the comparison.
One way to evaluate score differences is the SEM band overlap approach. This is simple and is based on underlying statistical calculations (averaged across different scenarios to allow for a simple rule of thumb) that incorporates information about the reliability of the difference score
. Test publishers also provide tables to evaluate the statistical significance of differences of a certain magnitude for subtests, such as in the various Wechsler manuals and software. These are all psychometrically sound and defensible procedures.......let me say that again...these are all psychometrically sound and defensible procedures. I repeat this phrase as the point I make below was recently misinterpreted at a state SP workshop as me saying there was something wrong with tables in the WISC-IV...which is NOT what I said and is NOT what I am saying here).
However, it is my opinion that in these situations we must do better and there is a more appropriate and better metric for evaluating differences between two different test scores, ESPECIALLY when the underlying assumption is that the two measures should be similar because they form a composite or cluster. This implies "correlation"...and not simple comparison of any two tests.
When one is attempting to evaluate the "unity" of a cluster or composite, an SD(diff) metric should be used that is consistent with the underlying assumption of the question. Namely, one is expecting the scores to be similar because they form a factor. This implies "correlation" between the measures. There is an SD(diff) calculation that incorporates the correlation between the measures being compared. When one uses this approach, the proper SD(diff) can vary from as small as approximately 10 points (for "tight" or highly correlated Gc tests) to as high as approximately 27 pts (for "loose" or weekly correlated tests in a cluster).
The information for this SD(diff) metric comes from a classic 1957 article by Payne and Jones (click here
) (thanks to Joel S. for brining it to my attention recently). Also, below are two tables that show the different, and IMHO, more appropriate SD(diff) values that should be used when making some example test comparisons on the WISC-IV and WJ-III. (Click on images to enlarge)
As you see in the tables, the 15 (3 if using scaled scores) and 22 (4.5 if scaled scores) rules-of-thumb will only be correct when the correlation between the two tests being compared is of a moderate magnitude. When the correlation between tests being compared is high (when you have a "tight" ability domain) the appropriate SDdiff metric to evaluate differences can be as low as 9.9 points (for 1 SDdiff) and 14.8 (for 1.5 SDdiff) for the Verbal Comp/Gen Info test from the WJ-III Gc cluster or 2.2 scaled score (1 SDdiff) and 3.3 (1.5 SDdiff) when comparing WISC-IV Sim/Vocab.
In contrast, when the ability domain is very wide or "loose", one would expect more variability since traits/tests are not as correlated. In reviewing the above tables one concludes that the very low test correlations for the tests that comprise the WJ-III Gv and Glr clusters produce a 1 SDdiff that is nearly TWICE the 15 point rule of thumb (27-28 points).
I have argued this point with a number of quants (and some have agreed with me) but believe that the proper SS(diff) to be used is not "one size fits all situations." The confidence band and traditional tables of subtest significant difference approaches are psychometrically sound and work when comparing any two tests. However, when the question becomes one of comparing tests where the fundamental issue revolves around the assumption that the tests scores should be similar because they share a common ability (are correlated)
, then IMHO, we can do better...there is a better way for these situations. We can improve our practice. We can move forward.
This point is analogous to doing simple t-tests of group means. When one has two independent samples the t-test formula includes a standard error term (in the denominator) that does NOT include any correlation/covariance parameter. However, when one is calculating a dependent sample t-test (which means there is a correlation between the scores), the error term incorporates information about the correlation. It is the same concept.....just applied to group vs individual score comparisons.
I urge people to read the 1957 article, review the tables I have provided above, and chew on the issue. There is a better way. The 15/22 SS rule of thumb is only accurate when a certain moderate level of correlation exists between the two tests being compared and when the comparison implies a common factor or ability. If one uses this simplistic rule of thumb practitioners are likely using a much too stringent rule in the case of highly correlated tests (e.g., Gc) and being overly liberal when evaluating tests from a cluster/composite that are low in correlation (what I call ability domain cohesion--click here for prior post that explains/illustrates this concept
). The 15/22 SS rule of thumb is resulting in inaccurate decisions regarding the unusualness of test differences when we fail to incorporate information about the correlation between the compared measures. And, even when such differences are found via this method (or the simple score difference method), this does not necessarily indicate that something is "wrong" and the cluster can't be computed or interpreted. This point was recently made clear in an instructional video
by Dr. Joel Schneider on sources of variance in test scores that form composites.
If using the recommended SDdiff metric recommended here is to much work, I would recommend that practitioners steer clear of the 15/22 (1/1.5 SD) rule-of-thumb and instead use the tables provided by the test publishers or use the simple SEM confidence band overlap rule-of-thumb. Sometimes simpler may be better.
- iPost using BlogPress from my Kevin McGrew's iPad