Showing posts with label difference scores. Show all posts
Showing posts with label difference scores. Show all posts

Friday, May 06, 2016

"Intelligent" Intelligence Testing with the WJ IV Tests of Cognitive Ability #5: Who do some individuals obtain markedly different scores on different Gwm tests?



This is the second PPT module (from my "Intelligent" Intelligence Testing with the WJ IV Tests of Cognitive Ability series) dealing with evaluating score differences between different tests of working memory (Gwm) on the WJ IV.  It is recommended that one view the brief first installment ("Intelligent" intelligence testing with the WJ IV Tests of Cognitive Ability #4: Evaluating within-CHC domain test score differences) before viewing this second installment.

"Intelligent" intelligence testing: Why do some individuals obtain markedly different scores on different Gwm tests? from Kevin McGrew

By clicking on the "in" icon in the lower right corner of the control pane, one can reach a site to download the slides.  For those who want a crisp clear set of the slides in PDF, click on this link. 

Wednesday, May 04, 2016

"Intelligent" intelligence testing with the WJ IV Tests of Cognitive Ability #4: Evaluating within-CHC domain test score differences


Recently the following was posted to the CHC listerv.

"While working on an assessment this morning, I found another example of Verbal Attention as a measure of Gwm being discordant from other measures of short-term and working memory: 
18 year old with high functioning autism and LD:
WAIS IV FSIQ = 84; WJ IV GIA = 81
WJ IV Verbal Attn - 91, Story Recall - 91, Vis Aud Learning - 76;
(Gc-73, Gf-94, Gv-121, Ga- 61, Gs-82)
WAIS-IV DSF-3, DSB - 7, DSS - 6, Arith - 4
CTOPP 2 Phonol Memory - 70; Rapid Naming - 52  
Thoughts on the Verbal Attention score of 91?"

A number of members provided some good responses.  I have been working on a response, and this is installment #1.  The rest of the response, which gets into substantive hypotheses about why Verbal Attention may differ from other Gwm test scores, is almost done and will be posted (hopefully) by the end of this week.

I apologize for not continuing my "Intelligent" intelligence testing series...it has been a long time since the last post in that series.  Just busy busy busy.

Enjoy

PS.   By clicking on the "in" icon in the lower right corner of the control pane, one can reach a site to download the slides.  For those who want a crisp clear set of the slides in PDF, click on this link.



Thursday, February 18, 2016

How to evaluate the unusualness (base rate) of WJ IV cluster or test score differences: It is a pleasure to use the correct measure - A SlideShare presentation

The WJ IV provides two primary methods for comparing tests or cluster scores.  One is based on a predictive model (the variation and comparison procedures) and the other allows comparisons of SEM confidence bands, which takes into account each measures reliability.  A third method for comparing scores, one that takes into account the correlation between compared measures (ability cohesion model) is not provided, but is frequently used by assessment professionals.  The three types of score comparison methods are described and new information, via a "rule of thumb" summary slide and nomograph, are provided to allow WJ IV users to evaluate scores via all three methods.

A PDF copy of the key WJ IV base rate rule-of-thumb slide can be found here.

Tuesday, July 19, 2011

More on the problem with the -1 SD [15 SS (3 ss)] IQ subtest discrepancy rule-of-thumb

In a prior post I raised concerns about the use of the 1 SD (15 SS/3 ss) rule-of-thumb for evaluating differences between two IQ subtest scores that are part of the same composite or cluster. My central point was that this simplistic rule-of-thumb fails to incorporate information regarding the cohesiveness or inter-correlation of the tests within a cluster. More importantly, some human ability domains are more cohesive/tight (e.g., Gc) than others (Gv), and the resulting correlation between two compared tests require the use of the SD (diff) formula that incorporates the correlation between the tests within a domain that are to be compared.

I presented estimated SD (diff) values for select subtest comparisons within the WISC-IV and WJ III in different construct domains. The estimates used the SD (diff) formula that includes the correlation between the measures to be compared.

Knowing that some folks don't like formula's and estimates, I decided to make the point more concrete with real data. A picture is worth a thousand words (or equations).

In the prior post I reported an estimated SD (diff) for the comparison of the WJ III Verbal Comprehension and General Information Gc tests of 9.9, based on their average correlation (across all norm subjects) of .78.

Today I went to the WJ III NU norm data and subtracted all General Information SS's from Verbal Information SS. I then calculated summary stats and generated the histogram below. [Click on image to enlarge]



Beautiful...don't you think? A normal distribution centered on zero (Mean = -0.5) and with an actual data-based SD of 9.8 (9.8 is almost identical to the 9.9 value resulting from the equation method).

Study the graph. It clearly shows that if clinicians want to determine if the WJ III Verbal Comprehension and General Information SS's are 1 SD different (1 SD[diff], technically), then a difference of approximately 10 points is what an examiner should look for...not 15! If an examiner uses the inaccurate rule-of-thumb (i.e, difference of 15 points is 1 SD), in reality the examiner, in the case of these two WJ III Gc tests, is actually requiring a difference of -1.5 SD (diff)....or 15 points.

See prior post for lengthier discussion of the logic, equations, and danger in invoking a subtest difference rule-of-thumb of -1 SD=15 (or, -1 SD = 3 for scaled scores).



- iPost using BlogPress from my Kevin McGrew's iPad

Generated by: Tag Generator


Friday, July 15, 2011

Intelligent IQ testing: Joel Schneider on proper interpretation of composite/cluster scores







Dr. Joel Schneider has (again) posted an amazing and elegant video tutorial to help individuals who engage in intelligence test interpretation understand whether composite/cluster scores should be interpreted as valid when the individual subtests comprising the composite are significantly different or discrepant (according to Dr. Schneider--"short answer: not very often"). It is simply AWESOME...and makes me envious that I don't have the time or skills to develop similar media content.

His prior and related video can be found here.

Clearly the message is that the interpretation of test scores is not simple and is clearly a mixture of art and science. As Tim Keith once said in a journal article title (1997)...."Intelligence is important, intelligence is complex." This should be modified to read "intelligence is important, intelligence is complex, and intelligent intelligence test interpretation is also complex."


- iPost using BlogPress from my Kevin McGrew's iPad

Generated by: Tag Generator


Monday, June 20, 2011

IAP 101 Psychometric Brief # 9: The problem with the 1/1.5 SD SS (15/22) subtest comparison "rule-of-thumb"

In regard to my prior "temp" post, I wrote so much in my NASP listserv response that I have decided to take my email response, correct a few typo's, and post it now as blog post. I may return to this later to write a lengthier IAP 101 Research Brief or report.

Psychologists who engage in intelligence testing frequently compare subtest scores to determine if they are statistically and practically different...as part of the clinical interpretation process. Most IQ test publishers provide sound statistical procedures (tables or software for evaluating the statistical difference of two test scores; confidence band comparison rules-of-thumb).

However, traditional and clinical lore has produced a common "rule-of-thumb" that is problematic. The typical scenario is when a clinician subtracts two test SS's (M=100; SD=15) and invokes the rule-of-thumb that the difference needs to be 15 SS points (1 SD) or 22/23 points (1.5 SD). This is not correct.

SS difference scores do NOT have an SD scale of 15! When you subtract two SS's (with mean=100; SD=15) the resultant score distribution has a mean of zero and an SD that is NOT 15 (unless you transform/rescale the distribution to this scale) The size of the difference SD is a function of the correlation between the two measures compared.

The SD(diff) is the statistic that should be used, and there are a number of different forumla for computing this metric. The different SD(diff)'s differ based on the underlying question or assumptions that is the basis for making the comparison.

One way to evaluate score differences is the SEM band overlap approach. This is simple and is based on underlying statistical calculations (averaged across different scenarios to allow for a simple rule of thumb) that incorporates information about the reliability of the difference score. Test publishers also provide tables to evaluate the statistical significance of differences of a certain magnitude for subtests, such as in the various Wechsler manuals and software. These are all psychometrically sound and defensible procedures.......let me say that again...these are all psychometrically sound and defensible procedures. I repeat this phrase as the point I make below was recently misinterpreted at a state SP workshop as me saying there was something wrong with tables in the WISC-IV...which is NOT what I said and is NOT what I am saying here).

However, it is my opinion that in these situations we must do better and there is a more appropriate and better metric for evaluating differences between two different test scores, ESPECIALLY when the underlying assumption is that the two measures should be similar because they form a composite or cluster. This implies "correlation"...and not simple comparison of any two tests.

When one is attempting to evaluate the "unity" of a cluster or composite, an SD(diff) metric should be used that is consistent with the underlying assumption of the question. Namely, one is expecting the scores to be similar because they form a factor. This implies "correlation" between the measures. There is an SD(diff) calculation that incorporates the correlation between the measures being compared. When one uses this approach, the proper SD(diff) can vary from as small as approximately 10 points (for "tight" or highly correlated Gc tests) to as high as approximately 27 pts (for "loose" or weekly correlated tests in a cluster).

The information for this SD(diff) metric comes from a classic 1957 article by Payne and Jones (click here) (thanks to Joel S. for brining it to my attention recently). Also, below are two tables that show the different, and IMHO, more appropriate SD(diff) values that should be used when making some example test comparisons on the WISC-IV and WJ-III. (Click on images to enlarge)






As you see in the tables, the 15 (3 if using scaled scores) and 22 (4.5 if scaled scores) rules-of-thumb will only be correct when the correlation between the two tests being compared is of a moderate magnitude. When the correlation between tests being compared is high (when you have a "tight" ability domain) the appropriate SDdiff metric to evaluate differences can be as low as 9.9 points (for 1 SDdiff) and 14.8 (for 1.5 SDdiff) for the Verbal Comp/Gen Info test from the WJ-III Gc cluster or 2.2 scaled score (1 SDdiff) and 3.3 (1.5 SDdiff) when comparing WISC-IV Sim/Vocab.

In contrast, when the ability domain is very wide or "loose", one would expect more variability since traits/tests are not as correlated. In reviewing the above tables one concludes that the very low test correlations for the tests that comprise the WJ-III Gv and Glr clusters produce a 1 SDdiff that is nearly TWICE the 15 point rule of thumb (27-28 points).

I have argued this point with a number of quants (and some have agreed with me) but believe that the proper SS(diff) to be used is not "one size fits all situations." The confidence band and traditional tables of subtest significant difference approaches are psychometrically sound and work when comparing any two tests. However, when the question becomes one of comparing tests where the fundamental issue revolves around the assumption that the tests scores should be similar because they share a common ability (are correlated), then IMHO, we can do better...there is a better way for these situations. We can improve our practice. We can move forward.

This point is analogous to doing simple t-tests of group means. When one has two independent samples the t-test formula includes a standard error term (in the denominator) that does NOT include any correlation/covariance parameter. However, when one is calculating a dependent sample t-test (which means there is a correlation between the scores), the error term incorporates information about the correlation. It is the same concept.....just applied to group vs individual score comparisons.

I urge people to read the 1957 article, review the tables I have provided above, and chew on the issue. There is a better way. The 15/22 SS rule of thumb is only accurate when a certain moderate level of correlation exists between the two tests being compared and when the comparison implies a common factor or ability. If one uses this simplistic rule of thumb practitioners are likely using a much too stringent rule in the case of highly correlated tests (e.g., Gc) and being overly liberal when evaluating tests from a cluster/composite that are low in correlation (what I call ability domain cohesion--click here for prior post that explains/illustrates this concept). The 15/22 SS rule of thumb is resulting in inaccurate decisions regarding the unusualness of test differences when we fail to incorporate information about the correlation between the compared measures. And, even when such differences are found via this method (or the simple score difference method), this does not necessarily indicate that something is "wrong" and the cluster can't be computed or interpreted. This point was recently made clear in an instructional video by Dr. Joel Schneider on sources of variance in test scores that form composites.

If using the recommended SDdiff metric recommended here is to much work, I would recommend that practitioners steer clear of the 15/22 (1/1.5 SD) rule-of-thumb and instead use the tables provided by the test publishers or use the simple SEM confidence band overlap rule-of-thumb. Sometimes simpler may be better.


- iPost using BlogPress from my Kevin McGrew's iPad