Showing posts with label AP101. Show all posts
Showing posts with label AP101. Show all posts

Monday, September 10, 2012

AP101 Brief # 16: Beyond CHC: Within-CHC Domain Complexity Optimized Measures

[Note:  This is a working draft of a larger paper (Implications of 20 years of CHC Cognitive-Achievement Research:  Back-to-the-future and Beyond CHC) that will be presented at the first Inaugural Session of the Richard Woodcock Institute for Advancement of Contemporary Cognitive Assessment at Tufts University (Sept, 29, 2012):  The Evolution of CHC Theory and Cognitive Assessment).]   Working knowledge of the WJ III test batery will make this brief easier to understand, but is not necessary.

Beyond CHC:  ITD—Within-CHC Domain Complexity Optimized Measures
            Optimizing Cognitive Complexity of CHC measures
I have recently begun to recognize the contribution that The Brunswick Symmetry derived Berlin Intelligence Structure (BIS) model can make in applied intelligence research, especially for increasing predictor-criteria relations by maximizing these relations via matching the predictor-criteria space on the dimension of cognitive complexity.  What is cognitive complexity?  Why is it important?  More important, what role should it play in designing intelligence batteries to optimize CHC COG-ACH relations?
Cognitive complexity is often operationalized by inspecting individual test loadings on the first principal component from principal component analysis (Jensen, 1998).  The high g-test rationale is that performance on tests that are more cognitively complex “invoke a wider range of elementary cognitive processes (Jensen, 1998; Stankov, 2000, 2005)” (McGrew, 2010b, p. 452).  High g-loading tests are often at the center of MDS (multidimensional scaling) radex models (click here for AP101 Brief Report #15:  Cognitive-Aptitude-Achievement Trait Complexes example)—but this isomorphism does not always hold.   David Lohman, a student of Richard Snow’s, has made extensive use of MDS methods to study intelligence and has one of the best grasps of what cognitive complexity, as represented in the hyperspace of MDS figures, contributes to understanding intelligence and intelligence tests.  According to Lohman (2011), those tests closer to the center are more cognitively complex due five possible factors—larger number of cognitive component processes; accumulation of speed component differences: more important component processes (e.g., inference); increased demands of attentional control and working memory; and/or or more demands on adaptive functions (assembly, control, and monitoring).  Schneider’s (in press) level of abstraction description of broad CHC factors is similar to cognitive complexity.  He uses the simple example of 100 meter hurdle performance.  According to Schneider (in press), one could independently measure 100 meter sprinting speed and then standing still and jumping over a hurdle (both examples of narrow abilities).  However, running a 100 meter race is not the mere sum of the two narrow abilities and as is more of a non-additive combination and integration of narrow abilities.  This analogy captures the essence of cognitively complexity—which, in the realm of cognitive measures, are tasks that have more of the five factors listed by Lohman involvedduring successful task performance.
Of critical importance is the recognition that factor or ability domain breadth (i.e., broad or narrow) is not synonymous with cognitively complexity.  More important, cognitive complexity has not always been a test design concept (as defined by the Brunswick Symmetry and BIS model) explicitly incorporated into "intelligent" intelligence test design (ITD).  A number of tests have incorporated the notion of cognitive complexity in their design plans, but I believe this type of cognitive complexity is different than the within-CHC domain cognitive complexity discussed here.
For example, according to Kaufman and Kaufman (2004), “in developing the KABC-II, the authors did not strive to develop ‘pure’ tasks for measuring the five CHC broad abilities.  In theory, Gv tasks should exclude Gf or Gs, for example, and tests of other broad abilities, like Gc or Glr, should only measure that ability and none other.  In practice, however, the goal of comprehensive tests of cognitive ability like the KABC-II is to measure problem solving in different contexts and under different conditions, with complexity being necessary to assess high-level functioning” (p. 16; italics emphasis added).  Although the Kaufman’s address the importance of cognitively complex measures in intelligence test batteries, their CHC-grounded description defines complex measures as those that are factorially complex or mixed measures of abilities from more than one broad CHC domain.  The Kaufman’s also address cognitive complexity from the non-CHC neurocognitve three-block functional Luria neurocognitive model when they indicate that it is important to provide measurement that evaluates the “dynamic integration of the three blocks” (Kaufman & Kaufman, 2004, p.13).   This emphasis on neurocognitive integration (and thus, complexity) is also an explicit design goal of the latest Wechsler batteries.  As stated in the WAIS-IV manual (Wechsler, 2008), “although there are distinct advantages to the assessment and division of more narrow domains of cognitive functioning, several issues deserve note.  First, cognitive functions are interrelated, functionally and neurologically, making it difficult to measure a pure domain of cognitive functioning” (p. 2).  Furthermore, “measuring psychometrically pure factors of discrete domains may be useful for research, but it does not necessarily result in information that is clinically rich or practical in real world applications (Zachary, 1900)” (Wechsler, p. 3).   Finally, Elliott (2007) similarly argues for the importance of recognizing neurocognitive-based “complex information processing” (p. 15; italics emphasis added) in the design of the DAS-III, which results in tests or composites measuring across CHC-described domains, as important in test design.
The ITD principle explicated and proposed here is that of striving to develop cognitively complex measures within broad CHC domains—that is, not attaining complexity via the blending of abilities across CHC broad domains and not attempting to directly link to neurocognitive network integration.[1]   The Brunswick Symmetry based BIS model provides a framework for attaining this goal via the development and analysis of test complexity by paying attention to cognitive content and operations facets. 
Figure 12 presents the results of a 2-D MDS Radex model of most all key WJ III broad and narrow CHC cognitive and achievement clusters (for all norm subjects from approximately 6 years of age thru late adulthood). [2]   The current focus of the interpretation of the results in Figure 12 is only on the degree of cognitive complexity (proximity to the center of the figure) of the broad and narrow WJ III clusters within the same domain (interpretations of the content and operations facets are not a focus of this current material).  Within a domain the broadest three-test parent clusters are designated by black circles.[3]  Two-test broad clusters are designed by gray circles.  Two test narrow offspring clusters within broad domains are designated by white circles.  All clusters within a domain are connected to the broadest parent broad cluster by lines.  The critically important information is the within-domain cognitive complexity of the respective parent and sibling clusters as represented by their relative distances from the center of the figure.  A number of interesting conclusions are apparent. [Click on image to enlarge]

First, as expected, the WJ III GIA-Ext cluster is almost perfectly centered in the figure—it is clearly the most cognitively complex WJ III cluster.   In comparison, the three WJ III Gv clusters are much weaker in cognitive complexity than all other cognitive clusters with no particular Gv cluster demonstrating a clear cognitive complexity advantage.    As expected, the measured reading and math achievement clusters are primarily cognitively complex measures.  However, those achievement clusters that deal more with basic skills (Math Calculation—MTHCAL; Basic Reading Skills—RDGBS) are less complex that the application clusters (Reading Comprehension-RDGCMP; Math Reasoning-MTHREA). 
The most intriguing findings in Figure 12 are the differential cognitive complexity patterns within CHC domains (with at least one parent and at least one offspring cluster).  For example, the narrow Perceptual Speed (Gs-P) offspring cluster is more cognitively complex than the broad parent Gs cluster.  The broad Gs cluster is comprised of the Visual Matching (Gs-P) and Decision Speed (Gs-R9; Glr-NA) tests, tests that measure different narrow abilities.  In contrast the Perceptual Speed cluster (Gs-P) is comprised of two tests that are classified as both measuring the same narrow ability (perceptual speed).  This finding appears, on first blush, counterintuitive as one would expect a cluster comprised of tests that measure different content and operations (Gs cluster) would be more complex (as per the above definition and discussion) than one comprised of two measures of the same narrow ability (Gs-P).  However, one must task analyze the two Perceptual Speed tests to realize that although both are classified as measuring the same narrow ability (perceptual speed), they differ in both stimulus content and cognitive operations.  Visual Matching requires processing of numeric stimuli.  Cross Out requires the processing of visual-figural stimuli.  These are two different content facets in the BIS model.  The Cross Out visual-figural stimuli are much more spatially challenging than the simple numerals in Visual Matching.  Furthermore, the Visual Matching test requires the examinee to quickly seek out and discover and mark two digit pairs that are identical.  In contrast, in the Cross Out test the subject is provided a target visual-figural shape and the subject must then quickly scan a row of complex visual images and mark two that are identical to the target.  Interesting, in other unpublished  analyses I have completed, the Visual Matching test often loads on or groups with quantitative achievement tests while Cross Out has frequently show to load on a Gv factor.  Thus, task analysis of the content and cognitive operations of the WJ III Perceptual Speed tests suggests that although both are classified as narrow indicators of Gs-P, they differ markedly in task requirements.  More important, the Perceptual Speed cluster tests, when combined, appear to require more cognitively complex processing than the broad Gs cluster.  This finding is consistent with Ackerman, Beier and Boyle’s (2002) research that suggests that perceptual speed has another level of factor breadth via the identification of four subtypes of perceptual speed (i.e., pattern recognition, scanning, memory and complexity; see McGrew 2005 and Schneider & McGrew, 2012 for discussion of a hierarchically organized model of speed abilities).  Based on Bruinswick Symmetry/BIS cognitive complexity principles, one would predict that a Gs-P cluster comprised of two parallel forms of the same task (e.g., two Visual Matching or two Cross Out tests) would be less cognitively complex than broad Gs.  A hint of the possible correctness of this hypothesis is present in the inspection of the Gsm-MS-MW domain results.
The WJ III Gsm cluster is the combination of the Numbers Reversed (MW) and Memory for Words (MS) tests.  In contrast, the WJ III Auditory Memory Span cluster (AUDMS; Gsm-MS) cluster is much less cognitively complex when compared to Gsm (see Figure 12).  Like the Perceptual Speed (Gs-P) cluster described in the context of the processing speed family of clusters, the Auditory Memory Span cluster is comprised of two tests with the same memory span (MS) narrow ability classification (Memory for Words; Memory for Sentences).  Why is this narrow cluster less complex than its broad parent Gsm cluster while the opposite held true for Gs-P and Gs?  Task analysis suggests that the two memory span tests are more alike than the two perceptual speed tests.  The Memory for Words and Memory Sentences tests require the same cognitive operation—simply repeating back, in order, words or sentences spoken to the subject.  This differs from the WJ III Perceptual Speed cluster as the similarly classified narrow Gs-P tests most likely invoke both common and different cognitive component operations.  Also, the Memory Span cluster tests are comprised of stimuli from the same BIS content facet (i.e., words and sentences; auditory-linguistic/verbal).  In contrast, the Gs-P Visual Matching and Cross Out tests involve two different content facets (numeric and visual-figural).
In contrast, the WJ III Working Memory cluster (Gsm-MW) is more cognitively complex than the parent Gsm cluster.  This finding is consistent with the prior WJ III Gs/Perceptual Speed and WJ III Gsm/Auditory Memory Span discussion.  The WJ III Working Memory cluster is comprised of the Numbers Reversed and Auditory Working Memory tests.  Numbers Reversed requires the processing of stimuli from one BIS content facet—numeric stimuli.  In contrast, Auditory Working Memory requires the processing of stimuli from two BIS content factors—numeric and auditory-linguistic/verbal; numbers and words).  The cognitive operations of the two tests also differ.  Both require the holding of the presented stimuli in active working memory space.  Numbers Reversed then requires the simple reproduction of the numbers in reverse order.  In contrast, the Auditory Working Memory test requires the storage of the numbers and words in separate chunks, and then the production of the forward sequence of each respective chunk (numbers or words), one chunk before the other.  Greater reliance on divided attention is most likely occurring during the Auditory Working Memory test. 
In summary, the results presented in Figure 12 suggest that it is possible to develop cluster scores that vary by degree of cognitively complexity within the same broad CHC domain.  More important is the finding that the classification of clusters as broad or narrow does not provide information on the measures cognitive complexity.  Cognitively complexity, as defined in the classification of clusters as broad or narrow does not provide information on the measures cognitive complexity.  Cognitive complexity, as in the Lohman sense, can be achieved within CHC domains without resorting to mixing abilities across CHC domains.  Finally, narrow clusters can be more cognitively complex, and thus likely better predictors of complex school achievement, than broad clusters or other narrow clusters. 

Implications for Test Battery Design and Assessment Strategies
The recognition of cognitive complexity as an important ITD principle suggests that the push to feature broad CHC clusters in contemporary test batteries, or in the construction of cross-battery assessments, fails to recognize the importance of cognitive complexity.  I plead guilty to contributing to this focus via my role in the design of the WJ III which focused extensively on broad CHC domain construct representation—most WJ III narrow CHC clusters require the use of the third WJ III cognitive book (the Diagnostic Supplement; Woodcock, McGrew, Mather & Schrank, 2003).  Similarly, guilty as charged in the dominance of broad CHC factor representation in the development of the original cross-battery assessment principles (Flanagan & McGrew, 1997; McGrew & Flanagan, 1998). 
It is also my conclusion that the narrow is better conclusion of McGrew and Wendling (2010) may need modification.   Revisiting the McGrew and Wendling (2010) results suggest that the narrow CHC clusters that were more predictive of academic achievement likely may have been so not necessarily because they are narrow, but because they are more cognitively complex.  I offer the hypothesis that a more correct principle is that cognitively complex measures are better.   I welcome new research focused on testing this principle.
In retrospect, given the universe of WJ III clusters, a broad+narrow hybrid approach to intelligence battery configuration (or cross-battery assessment) may be more appropriate.  Based exclusively on the results presented in Figure 12, the following clusters would appear those that might better be featured in the “front end” of the WJ III or a selective testing constructed assessment—those clusters that examiners should consider first within each CHC broad domain:  Fluid Reasoning (Gf)[4], Comprehension-Knowledge (Gc), Long-term Retrieval (Glr), Working Memory (Gsm-MW), Phonemic Awareness 3 (Ga-PC), and Perceptual Speed (Gs-P).  No clear winner is apparent for Gv, although the narrow Visualization cluster is slightly more cognitively complex than the Gv and Gv3 clusters.  The above suggests that if broad clusters are desired for the domains of Gs, Gsm and Gv, then additional testing beyond the “front end” or featured tests and clusters would require administration of the necessary Gs (Decision Speed), Gsm (Memory for Words) and Gv (Picture Recognition) tests.

Utilization of the ITD test design principle of optimizing within-CHC cognitively complexity of clusters suggests that a different emphasis and configuration of WJ III tests might be more appropriate.  It is proposed that the above WJ III cluster complexity priority or feature model would likely allow practitioners to administer the best predictors of school achievement.  I further hypothesize that this cognitive complexity based broad+narrow test design principle most likely applies to other intelligence test batteries that have adhered to the primary focus on featuring tests that are the purest indicators of two or more narrow abilities within the provided broad CHC interpretation scheme.  Of course, this is an empirical question that begs research with other batteries.  More useful with be similar MDS Radex cognitive complexity analysis of cross-battery intelligence data sets.[5]

References (not included in this post.  The complete paper will be announced and made available for reading and download in the near future)



[1] This does not mean that cognitive complexity may not be related to the integrity of the human connectome or different brain networks. I am excited about contemporary brain network research (Bressler & Menon, 2010; Cole, Yarkoni, Repovs, Anticevic & Braver, 2012; Toga, Clark, Thompson, Shattuck, & Van Horn, 2012; van den Heuvel & Sporns, 2011), particularly that which has demonstrated links between neural network efficiency and working memory, controlled attention and clinical disorders such as ADHD (Brewer, Worunsky, Gray, Tang, Weber & Kober, 2011; Lutz, Slagter, Dunne, & Davidson, 2008; McVay & Kane, 2012). The Parietal-Frontal Integration (P-FIT) theory of intelligence is particularly intriguing as it has been linked to CHC psychometric measures (Colom, Haier, Head, Álvarez-Linera, Quiroga, Shih, & Jung, 2009; Deary, Penke, & Johnson, 2010; Haier, 2009; Jung & Haier, 2007) and could be linked to CHC cognitively-optimized psychometric measures.
[2] Only reading and math clusters were included to simplify the presentation of the results and the fact, as reported previously, that reading and writing measures typically do not differentiate well in multivariate analysis—and thus the Grw domain in CHC theory.
[3] GIA-Ext is also represented by a black circle.
[4] Although the WJ III Fluid Reasoning 3 cluster (Gf3) is slightly closer to the center of the figure, the difference from Fluid Reasoning (Gf) is not large and time efficiency would argue for the two-test Gf cluster.
[5] It is important to note that the cognitive complexity analysis and interpretation discussed here is specific to within the WJ III battery only. The degree of cognitive complexity in the WJ III cognitive clusters in comparison to composite scores from other intelligence batteries can only be ascertained by cross-battery MDS complexity analysis.

Thursday, March 01, 2012

IAP101 Brief #12: Use of IQ component part scores as indicators of general intelligence in SLD and MR/ID diagnosis

   
            Historically the concept of general intelligence (g), as operationalized by intelligence test battery global full scale IQ scores, has been central to the definition and classification of individuals with a specific learning disability (SLD) as well as individuals with an intellectual disability (ID).  More recently, contemporary definitions and operational criteria have elevated intelligence test battery composite or part scores to a more prominent role in diagnosis and classification of SLD and more recently in ID.
            In the case of SLD, third-method consistency definitions prominently feature component or part scores in (a) the identification of consistency between low achievement and relevant cognitive abilities or processing disorders and (b) the requirement that an individual demonstrate relative cognitive and achievement strengths (see Flanagan, Fiorello & Ortiz, 2010).  The global IQ score is de-emphasized in the third-method SLD methods.
            In contrast, the 11th edition of the AAIDD Intellectual Disability: Definition, Classification, and Systems of Supports manual (AAIDD, 2010) placed general intelligence, and thus global composite IQ scores, as central to the definition of intellectual functioning.  This has not been without challenge.  For example, the AAIDD ID definition has been criticized for an over-reliance on the construct of general intelligence and for ignoring contemporary psychometric theoretical and empirical research that has converged on a multidimensional hierarchical model of intelligence (viz., Cattell-Horn-Carroll or CHC theory).
The potential constraints of the “ID-as-a-general-intelligence-disability” definition was anticipated by the Committee on Disability Determination for Mental Retardation, in its National Research Council report “Mental Retardation:  Determining Eligibility for Social Security Benefits” (Reschly, Meyers & Hartel, 2001).  This national committee of experts concluded that “during the next decade, even greater alignment of intelligence tests and the IQ scores derived from them and the Horn-Cattell and Carroll models is likely.  As a result, the future will almost certainly see greater reliance on part scores, such as IQ scores for Gc and Gf, in addition to the traditional composite IQ.  That is, the traditional composite IQ may not be dropped, but greater emphasis will be placed on part scores than has been the case in the past” (Reschly et al., 2002, p. 94).  The committee stated that “whenever the validity of one or more part scores (subtests, scales) is questioned, examiners must also question whether the test’s total score is appropriate for guiding diagnostic decision making.  The total test score is usually considered the best estimate of a client’s overall intellectual functioning.  However, there are instances in which, and individuals for whom, the total test score may not be the best representation of overall cognitive functioning.” (p. 106-107).
            The increased emphasis on intelligence test battery composite part scores in SLD and ID diagnosis and classification raises a number of measurement and conceptual issues (Reschly et al., 2002).  For example, what are statistically significant differences?  What is a meaningful difference?  What appropriate cognitive abilities should serve as proxies of general intelligence when the global IQ is questioned?  What should be the magnitude of the total test score? 
Appropriate cognitive abilities will only be the only issue discussed here.  This issue addresses  which component or part scores are more correlated with general intelligence (g)—that is, what component part scores are high g-loaders?  The traditional consensus has been that measures of Gc (crystallized intelligence; comprehension-knowledge) and Gf (fluid intelligence or reasoning) are the highest g-loading measures and constructs and are the most likely candidates for elevated status when diagnosing ID (Reschly et al., 2002).  Although not always stated explicitly, the third method consistency SLD definitions specify that an individual must demonstrate “at least an average level of general cognitive ability or intelligence” (Flanagan et al., 2010, p.745), a statement that implicitly suggests cognitive abilities and component scores with high g-ness.
Table 1 is intended to provide guidance when using component part scores in the diagnosis and classification of SLD and ID (click on images to enlarge and use the browser zoom feature  to view; it is recommended you click here to access a PDF copy of the table..and also zoom in on it).  Table 1 presents a summary of the comprehensive, nationally normed, individually administered intelligence batteries that possess satisfactory psychometric characteristics (i.e., national norm samples, adequate reliability and validity for the composite g-score) for use in the diagnosis of ID and SLD.



The Composite g-score column lists the global general intelligence score provided by each intelligence battery.  This score is the best estimate of a persons general intellectual ability, which currently is most relevant to the diagnosis of ID as per AAIDD.  All composite g-scores listed in Table 1 meet Jensens (1998) psychometric sampling error criteria as valid estimates of general intelligence.  As per Jensens number of tests criterion, all intelligence batteries g-composites are based on a minimum of nine tests that sample at least three primary cognitive ability domains.  As per Jensens variety of tests criterion (i.e., information content, skills and demands for a variety of mental operations), the batteries, when viewed from the perspective of CHC theory, vary in ability domain coveragefour (CAS, SB5), five (KABC-II, WISC-IV, WAIS-IV), six (DAS-II) and seven (WJ III) (Flanagan, Ortiz & Alfonso, 2007; Keith & Reynolds, 2010).   As recommended by Jensen (1998), the particular collection of tests used to estimate g should come as close as possible, with some limited number of tests, to being a representative sample of all types of mental tests, and the various kinds of test should be represented as equally as possible (p. 85).  Users should consult sources such as Flanagan et al. (2007) and Keith and Reynolds, 2010) to determine how each intelligence battery approximates Jensens optimal design criterion, the specific CHC domains measured, and the proportional representation of the CHC domains in each batteries composite g-score.
Also included in Table 1 are the component part scales provided by each battery (e.g., WAIS-IV Verbal Comprehension Index, Perceptual Reasoning Index, Working Memory Index, and Processing Speed Index), followed by their respective within-battery g-loadings.[1]  Examination of the g-ness of composite scores from existing batteries (see last three columns in Table 1) suggests the traditional assumption that measures of Gf and Gc are the best proxies of general intelligence may not hold across all intelligence batteries.[2] 
In the case of the SB5, all five composite part scores are very similar in g-loadings (h2 = .72 to .79).  No single SB5 composite part score appears better than the other SB5 scores for suggesting average general intelligence (when the global IQ score is not used for this purpose).  At the other extreme is the WJ III where the Fluid Reasoning, Comprehension-Knowledge, Long-term Storage and Retrieval cluster scores are the best g-proxies for part-score based interpretation within the WJ III.  The WJ III Visual Processing and Processing Speed clusters are not composite part scores that should be emphasized as indicators of general intelligence.  Across all batteries that include a processing speed component part score (DAS-II, WAIS-IV, WISC-IV, WJ III) the respective processing speed scale is always the weakest proxy for general intelligence and thus, would not be viewed as a good estimate of general intelligence. 
            It is also clear that one cannot assume that composites with similar sounding names of measured abilities should have similar relative g-ness status within different batteries.  For example, the Gv (visual-spatial or visual processing) clusters in the DAS-II (Spatial Ability), SB5 (Visual-Spatial Processing) are relatively strong g-measures within their respective battery, but the same cannot be said for the WJ III Visual Processing cluster.  Even more interesting are the differences in the WAIS-IV and WISC-IV relative g-loadings for similarly sounding index scores. 
For example, the Working Memory Index is the highest g-loading component part score (tied with Perceptual Reasoning Index) in the WAIS-IV but is only third (out of four) in the WISC-IV.   The Working Memory Index is comprised of the Digit Span and Arithmetic subtests in the WAIS-IV and the Digit Span and the Letter-Number Sequencing subtests in the WISC-IV.  The Arithmetic subtest has been reported to be a factorially complex test which may tap fluid intelligence (Gf-RQ—quantitative reasoning), quantitative knowledge (Gq), working memory (Gsm), and possible processing speed (Gs; Keith & Reynolds, 2010; Phelps, McGrew, Knopik & Ford, 2005).   The factorially complex characteristics of the Arithmetic subtest (which, in essence, makes it function like a mini-g proxy) would explain why the WAIS-IV Working Memory Index is a good proxy for g in the WAIS-IV but not in the WISC-IV. The WAIS-IV and WISC-IV Working Memory Index scales, although named the same, are not measuring identical constructs.

A critical caveat is that the g-loadings cannot be compared across different batteries.  g-loadings may change when the mixture of measures included in the analyses change.  Different "flavors" of g can result (Carroll, 1993; Jensen, 1998). The only way to compare the g-ness across batteries is with appropriately designed cross- or joint-battery analysis (e.g., WAIS-IV, SB5 and WJ III analyzed in a common sample).
The above within and across intelligence battery examples illustrates that those who use component part scores as an estimate of a person’s general intelligence must be aware of the composition and psychometric g-ness of the component scores within each intelligence battery.  Not all component part scores in different intelligence batteries are created equal (with regard to g-ness).  Also, not all similarly named factor-based composite scores may measure the same identical construct and may vary in degree of within battery g-ness.  This is not a new problem in the context of naming factors in factor analysis, and by extension, factor-based intelligence test composite scores, Cliff (1983) described this nominalistic fallacy in simple language—“if we name something, this does not mean we understand it” (p. 120). 




[1] As noted in the footnotes in Table 1, all composite score g-loadings were computed by Kevin McGrew by entering the smallest number (and largest age ranges covered) of the published correlation matrices within each intelligence batteries technical manual (note the exception for the WJ III) in order to obtain an average g-loading estimate.  It would have been possible to calculate and report these values for each age-differentiated correlation matrix for each intelligence battery.  However, the purpose of this table is to provide the best possible average value across the entire age-range of each intelligence battery.  Floyd and colleagues have published age-differentiated g-loadings for the DAS-II and WJ III.  Those values were not used as they are based on the use of the principal common factor analysis method, a method that  analyzes the reliable shared variance among tests.  Although principal factor and principal component loadings typically will order measures in the same relative position, the principal factor loadings typically will be lower.  Given that the imperfect manifest composite scale scores are those that are utilized in practice, and to also allow uniformity in the calculation of the g-loadings reported in Table 1, principal component analysis was used in this work. The same rationale was used for not using the latent factor loadings on a higher-order g-factor in SEM/CFA analysis of each test battery.  Loadings from CFA analyses represent the relations between the underlying theoretical ability constructs and g purged of measurement error.  Also, frequently the final CFA solutions reported in a batteries technical manual (or independent journal articles) allow tests to be factorially complex (load on more than one latent factor), a measurement model that does not resemble the real world reality of the manifest/observed composite scores used in practice.  Latent factor loadings on a higher-order g-factor will often differ significantly from principal component loadings based on the manifest measures, both in absolute magnitude and relative size (e.g., see high Ga loading on g in WJ III technical manual which is at variance with the manifest variable based Ga loading reported in Table 1) 
[2] The h2 values are the values that should be used to compare the relative amount of g-variance present in the component part scores within each intelligence battery.

Tuesday, February 07, 2012

IAP Applied Psychometrics 101 Brief Report # 11: What is the typical IQ and adaptive behavior correlation?


What is the typical relation (correlation) between standardized measures of adaptive behavior (AB)  and measures of intelligence (IQ)?  This is an important question given the role both play in the definition diagnosis of mental retardation (MR) / intellectual disability (ID). 

During the late 1970's and 1980's this was an active area of research.  Numerous studies were published that reported correlations between a wide variety of adaptive behavior scales and intelligence tests.  Probably the best synthesis of this research was provided by Harrison (1987).  Harrison's review included a table of over 40+ correlations.  This is Table 2 in the above referenced and linked article.  Harrison concluded, as have most others who have reviewed the literature, that "the majority of correlations fall in the moderate range" (p.39).  When the correlations with maladaptive measures are excluded from Harrison's table, the correlations range from .03 to .91.  This is a wide range.  Harrison could not identify a specific explanation for the variability or range of correlations.  Harrison speculated that variables might impact the magnitude of the correlations were the specific adaptive behavior or measure of intelligence used and differences in sample variability.

Subsequently the Committee on Disability Determination for Mental Retardation published a National Research Council report (Mental Retardation:  Determining Eligibility for Social Security Benefits; Reschly, Meyers & Hartel, 2001) that also addressed the AB/IQ relation. The report concluded that AB/IQ studies report correlations "ranging from 0 (indicating no relationship) to almost +1 (indicating a perfect relationship).  Data also suggest that the relationship between IQ and adaptive behavior varies significantly by age and levels of retardation, being strongest in the severe and moderate ranges and weakest in the mild range.  There is a dearth of data on the relationship of IQ and adaptive behavior functioning at the mild level of retardation" (p. 8).  Factors identified as moderating the AB/IQ correlation were scale content, measurement of competences versus perceptions, sample variability, ceiling and floor problems of the scales, and level of mental retardation.

Given the above, it is hard to render an objective statement on the approximate typical AB/IQ correlation.  With this in mind, an informal research synthesis was completed and is reported here.

First, only the AB/IQ correlations (IQ/maladaptive correlations were excluded) from Harrison's 1987 table were extracted (n = 43 correlations).  Then, the technical manuals for the current editions of the three most frequently used contemporary adaptive behavior scales were reviewed for additional correlations.  This included the Vineland Adaptive Behavior Scale (Sparrow, Cicchetti & Balla, 2005; n = 2 correlations of .12, .20) and the Adaptive Behavior Scales--II (Harrison & Oakland, 2008; n = 10 correlations ranging from .39 to .67; median = .51).

Although six different correlations were reported in the Scales of Independent Behavior-Revised manual (SIB-R; Bruininks, Woodcock, Weatherman & Hill, 1996), the values were not used as they are inflated estimates when compared to the type of correlations typically reported.  For example, very high correlations of .79, .82 and .91 are reported for certain groups.  A close reading of the tables reveals that the SIB-R correlations with either the WJ or WJ-R intelligence test were calculated on the basis of the W-score growth metric.   By definition, a growth metric includes age variance.  If correlations are reported across wide age groups the correlations convey variance related to the correlation between the AB  and IQ constructs but also contains shared variance due to the influence of general age-base development (age).  Thus, the SIB and SIB-R correlations with IQ, although not wrong and providing different information, are not comparable to all other reported correlations where age variance has been removed (typically by correlating age-based standard scores).  Clear evidence for this point comes from McGrew and Bruininks (1990) who used the same SIB/WJ subject data reported in the SIB and SIB-R manuals, but who removed the W-score confounded age variance prior to the calculation of latent factor correlations (via confirmatory factor analysis) between latent practical intelligence (SIB adaptive behavior) and conceptual intelligence (WJ IQ) factors.  The resulting AB/IQ correlations for three different age groups were .38, .56 and .58--far below the values in the .70 to .92 range.  Thus, the values from McGrew and Bruininks (1990) were included for estimates of the SIB/SIB-R IQ correlations in the current synthesis. 

Finally, latent AB/IQ correlations (as estimated from confirmatory factor analysis models)  of .27 and .39 were included from Ittenbach, Spiegel, McGrew and Bruininks (1992) and Keith, Fehrmann,Harrison and Pottebaum (1987), respectively.  This process resulted in the addition of 17 AB/IQ correlations to the 43 from Harrison, for a total of 60 correlations.

Descriptive statistics for this collection of 60 AB/IQ correlations are as follows: range of correlations from .12 to .90,  a mean of .51 and a median of .48, and a standard deviation of .20.  Below is a figure that includes a frequency polygon (and smoothed normal curve overlay) and a box-whisker plot of the data set.  A review of the box and whisker plot (at the bottom) shows the median correlation (.48) as a vertical line within the rectangle.  The rectangle includes the 50% middle of the distributions of correlations and shows an approximate range of just below .40 to just above .65.  Of particular note is the shape of the frequency polygon and smoothed normal curve.  The shape of the frequency polygon is consistent with a normal curve.  In quantitative research synthesis this type of normal distribution suggests that total data set included in the review is not biased--both studies that are likely under- or overestimates of the "true" population correlation (due to method or sampling factors) are included.  More importantly, the "bunching" up of the majority of the correlations in the middle provide confidence that the median of this distribution is a reasonable unbiased estimate of the populaiton correaltion.  This type of relatively normal distribution suggests that the current collection of 60 AB/IQ correlations is likely a reasonable approximation of the complete set of population AB/IQ correlations.


Based on this informal (and admittedly incomplete review of all possible AB/IQ correlation research) one can conclude that a reasonable estimate of the typical AB/IQ correlation is approximately .50 (mean = .51; median = .48), with most ranging from approximately .40 to .65.  This finding is consistent with Harrison's 1987 conclusion of a "moderate" correlation.  The current analysis continues to reinforce Harrison's (and others) conclusions that adaptive behavior and intelligence are statistically related constructs, but  they are still independent.   An average correlation of .50 indicates that AB and IQ share approximately 25 % common variance (approximately 15% to 40 % common variance if one looks at the range of the 50% middle of the distribution of values).  In practical terms this means that for any individual, standard scores from AB and IQ tests will frequently diverge and not always be consistent.  

Harrison (1987) provides a nice explanation for the primary reasons for the moderate correlation between AB and IQ.  Her quote is reproduced below
Numerous caveats need to be applied to this analysis and report.  The most important are:
  • A comprehensive review of all possible published and unpublished AB/IQ research studies was not completed.  Clearly there are more studies "out there" that could be added to the synthesis. 
  • The analysis makes no attempt to determine if there are moderator effects.  That is, is the typical correlation likely to systematically vary as a function of AB measures, IQ measures, variability in the sample's level of functioning, manifest/measured versus latent variable correlations, level of ability, etc.? 
  •  This has not been peer reviewed.


 It is hoped that this ad hoc update of Harrison's (1987) review, augmented by quantitive organizational methods, will serve to stimulate a formal meta-analysis by others (hint---a nice study or thesis for someone?)




Wednesday, April 07, 2010

Psychometric PS to Johnston v Florida (2010) denied appeal re: new WAIS-IV scores

This is a follow-up to my brief comments yesterday regarding the Johstone v Fl (2010) denied MR/ID appeal of two days ago.

As mentioned in the decision and my blog comment, the WAIS-III/WAIS-IV tests correlated .94 in a study reported in the WAIS-IV technical manual.  This is a very high correlation...but does NOT mean that the two tests should be expected to provide identical IQ scores.  I discuss these issues in a prior IAP AP101 report.

The tests have different norm dates and thus, the later version (WAIS-IV) would be expected to provide a lower score based on the Flynn effect.  More importantly, as reported in the IAP AP101 report, when one calculates the standard deviation of the difference score (see page 6 of that report) for a correlation of .94, the resulting value is 5.2 (round to 5 for ease of discussion).  This means that, on average, the WAIS-III/WAIS-IV (even if highly correlated at the .94 level) would in the general population be expected to display a range of difference scores from -5 to +5...or a range of 10 IQ points......in 68% of the population.  Please review that prior report for further explanation and discussion.

Technorati Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Friday, February 05, 2010

AP101 Brief #6: Understanding Wechsler IQ score differences--the CHC evolution of the Wechsler FS IQ score

[Note.  A typo in the original tables used to construct the WAIS figure below has been fixed.  Visual Puzzles on the WAIS-IV had been incorrectly designated as a measure of Gf----it should have been classified Gv.  This has now been changed and the corresponding text also modified.  Sorry for this error.  Changes in the text are so designated below via the strikeover]

Why do the IQ scores for the same individual often differ?

This question often perplexes both users and recipients of psychological reports. In a previous IAP Applied Psychometrics 101 report (AP101 #1:  Understanding IQ score differences) I discussed general statistical information related to the magnitude and frequency of expected IQ score differences for different tests (as a function of the correlation between tests).  In that report I mentioned the following general categories of possible reasons for IQ score differences/discrepancies.
Factors contributing to significant IQ differences are many, and include: (a) procedural or test administration issues (e.g., scoring errors; improper test administration; malingering; age vs grade norms), (b) test norm or standardization differences (e.g., possible errors in the norms; sampling plan for selecting subjects for developing the test norms; publication date of test), (c) content differences, and/or, (d) in the case of group research, research methodology issues (e.g., sample pre-selection effects on reported mean IQs) (McGrew, 1994).
At this time I  return to one of these factors--content differences. This brief report does not focus on content differences between different IQ tests but, instead, focuses on the changing content across the various editions of the two primary Wechsler intelligence batteries (WISC/WAIS). This information should be useful when individuals are comparing IQ scores (for the same person) based on different versions of the Wechsler's .

Of course, content differences will not be the only reason for possible IQ score differences across editions of the Wechsler's for an individual. Other possible reasons may include real changes in intelligence, serious scoring errors present in either one of the two test administration's, the Flynn effect, and other possible factors.   This post focuses only on the changing CHC content of the WISC and WAIS series of intelligence batteries.

As discussed previously in numerous posts, contemporary CHC theory is currently considered the consensus psychometric taxonomy of human cognitive abilities (click here for prior posts and information regarding the theory).  For this current brief report, I reviewed the extant CHC-organized factor analysis literature of the variousWechsler intelligence batteries. I then used this information as per the following steps:

1.  I identified the individual subtests in all editions of the WISC and WAIS batteries that contributed to the respective Full Scale (FS) IQ score for each battery.

2.  Using the accepted authoritative sources re: the CHC analysis of the Wechsler intelligence batteries (Flanagan, McGrew and Ortiz, 2000; Flanagan, Ortiz, and Alfonso, 2007; McGrew and Flanagan, 1998; Woodcock, 1990), I classified each of the above identified subtests as per the broad CHC ability (or abilities) measured by each subtest.  For readers who want a very brief CHC overview (and ability definition cheat-sheet), click here.

3.  I calculated the percentage of each broad CHC ability represented in each batteries respective FS IQ. For example, for the 1974 WISC-R, the FS IQ is calculated by summing the WISC-R scaled scores from 10 of the individual subtests. Four of these 10 subtests (Information, Comprehension, Similarities, and Vocabulary) have all been consistently classified as indicators of broad Gc. Since each of the individual subtests contribute equally to the FS IQ score, Gc represents at least 40%  (4 of 10) of the WISC-R FS IQ. 
  • However, the extant CHC Wechsler research has consistently identified a few tests with dual CHC factor loadings. In particular, both Picture Completion and Picture Arrangement have been consistently reported to load on both the Gv (performance scale) and Gc (verbal scale) on the WISC-R. For tests that demonstrated consistent dual CHC factor loadings, I assigned each broad CHC ability measured as representing 1/2 (0.5) of the test. More precise proportional calculation might have been possible (via the calculation of the average factor loadings across all studies), but for the current purpose I used this  simple and (IMHO) reasonably approximate method.
  • As a result, both the Picture Completion and Picture Arrangement subtests were each assigned a 1/2 (0.5) Gc and 1/2 (0.5) ability classifications. When added together these two 0.5 Gc test classifications sum to 1.0. When combined with the other four clear Gc tests mentioned above, the final Gc test indicator total is 5.  As a result, the total Gc proportional percentage of the WISC-R FS IQ was calculated as 50%.
4.  Although the Wechsler CHC classifications were based on the primary source sources noted above, I did revise some commonly accepted classifications based upon my professional opinion (when supported by empirical research). For example, the Arithmetic subtest has frequently been classified as a measure of Gf, Gsm, and sometimes Gs.   However, when valid factor indicators of Quantitative Knowledge (Gq) have been included in analyses, the Arithmetic subtest consistently displays a robust loading on the Gq factor and only minor loadings on other CHC abilities. I placed greater stock in these studies (e.g., Phelps at al, 2005: Woodcock, 1990) as I deem these to be better designed CHC studies (they included a broader array of CHC ability indicators).  My final determination for Arithmetic was that it is a test that measures both Gq and Gsm.
  • In addition, where appropriate and consistent with published research, I modified a few other commonly accepted CHC Wechsler test classifications to reflect recent research (e.g.., Kaufman et al., 2001; Keith et al., 2006; Keith & Reynolds (in press--CHC abilities and cognitive tests: What we've learned from 20 years of research;  Psychology in the Schools); Lichtenberger & Kaufman, 2001; McGrew, 2009; Tulsky & Price, 2003; plus the factor studies reported in the respective technical manuals of each battery). Referring to the mixed measures of Picture Completion and Picture Arrangement mentioned above, research with the WISC-IV  has suggested that Picture Completion is primarily a measure Gv (Gc factor loading minimal or nonexistent) while Picture Arrangement continues to show significant loadings on both Gv and Gc. Thus, Picture Arrangement was classified as a mixed measure of Gc and Gv for all editions of the WISC. In contrast, in the case of the WISC-IV  Picture Completion was classified as a measure Gv.  
  • It is not possible to describe in detail all of the minor "fine tunings" I did for select Wechsler CHC test classifications. The basis for all are included in the various reference sources cited above. In the final analysis the Wechsler CHC test classifications used in this brief report are those made by myself (Kevin McGrew) based on my integration and understanding of the extant empirical research regarding the CHC abilities measured by individual tests in both the WISC and WAIS series of intelligence batteries.
5.  Finally, I calculated the proportion of CHC abilities represented in the FS IQ scores for all editions of the WISC and WAIS.  These value were tabled and plotted on graphs.  The summary graphs are presented below. [Double click on images to enlarge]





Conclusions/observations:  A review of all information presented (in and across both graphs) produces a number of interesting conclusions and hypotheses. I only present a few at this time. I encourage others to review the documents and provide additional insights or commentary via the comment feature of the blog or on various listserv's where I have posted and FYI message regarding this set of analysis.

1.  Historically, the FS IQ score from the Wechsler batteries, which is typically interpreted as a measure of general intelligence (g), has been heavily weighted towards the measurement of Gc and Gv abilities. This should not be surprising given the original design blueprint specified by David Wechsler (the measurement of intelligence vis-a-vis two different modes of expression).

2.  The WISC series remained constant in the CHC FS IQ composition from 1949 to 1991. Although tests may have been revised or replaced, the differential CHC proportional contribution to the FS IQ was relatively equal across all three editions. Following the 80% combined contribution of Gc and Gv, much smaller contributions to the FS IQ came from measures of Gs (10%) and Gq and Gsm (5% respectively).

3.  The WISC-IV represents a significant change in the general intelligence FS IQ score provided. Gc representation has decreased approximately 20%, Gv representation was cut in half (30 % to 15 %) ,  Gs abilities increased slightly (5 %), and Gq was eliminated. More importantly, there was a fourfold increase in the contribution of the Gsm (from 5% to 20%) and a 20% increase in Gf representation (from 0 to 20%)! Clearly different FS IQ scores may be obtained by the same individual when comparing WISC-IV FS IQ to either WISC-R/WISC-III scores.  More importantly,the difference may be a function of the different mixture of CHC abilities represented in the different editions of the WISC series. 

4.  The first two editions of the WAIS (WAIS and WAIS-R) were identical in differential CHC ability contribution to the FS IQ score. However, starting with the WAIS-III significant changes in the adult Wechsler battery commenced and were later amplified in the WAIS-IV. Both the WAIS-III and WAIS-IV FS IQs reduced the amount of Gc representation by approximately 14% to 15%. The contribution of Gv decreased only slightly (27.3% to 22.7%) from the WAIS-R to WAIS-III, but there was a dramatic reduction (by one half) and then another 2% from the WAIS-III to the WAIS-IV (22.7% to 10% 20%). Offsetting reductions in Gc and Gv over these two editions was a trend towards greater measurement of Gs (has doubled from around 9% from the early two editions to approximately 18% to 20% in the last two editions). Gq FS IQ contribution has remained relatively similar throughout all editions. The most dramatic change, which is also consistent with the WISC series, is an approximate tenfold increase (0 % to 9.1%) in Gf from the WAIS-R to the WAIS-III, which was again doubled in magnitude with the publication of the and WAIS-IV (10% 20%). In general, similar to the WISC series, the adult WAIS series FS IQ has slowly evolved in the CHC abilities represented by the FS IQ. Both Gc and Gv abilities have been systematically reduced concurrently with a significant increases in the contribution of Gs and Gf.

Implications of the CHC evolution of the WISC and WAIS FS IQ scores are many if one attempts to compare a current IQ score from one battery to an older score from a earlier edition of the same battery (or compare an older score from the childrens version to the latest edition of the adult version). Before one can assume that significant changes from a childhood WISC-based IQ to a WAIS-III or WAIS-IV  are due to certain factors (neurological insult; malingering, the Flynn effect, etc.), one should review the above graphs and consider the possibility that the different FS IQ scores may both be valid indicators of functioning but may represent differ CHC mixes (flavors) of general intelligence.

The potential implications and  hypotheses that can be generated with the aid of the above graphs are numerous. For example, Flynn (2006) has suggested that there are problems with the WAIS-III standardization norms given that studies comparing the WAIS-R/WAIS-III scores are not consistent with Flynn effect expectations.  According to Weiss (2007), Flynn is ignoring data that does not fit his theory and instead is using theory to question data (and the integrity of a tests norms). According to Weiss (2007), "the only evidence Flynn provides for this statement is that WAIS-III scores do not fit expectations made based on the Flynn effect. However, the progress of science demands that theories be modified based on new data. Adjusting data to fit theory is an inappropriate scientific method, regardless of how well supported the theory may have been in previous studies." (p.1 from abstract).

I tend to concur with Weiss's arguments that the mere finding that the WAIS-III results were inconsistent with  Flynn effect expectations is insufficient evidence to claim that the a test norms are wrong. If the data don't fit--one may need to retrofit (your theory or hypothesis).  By inspecting the second graph above, one can see that a  viable explanation for the apparent lack of the WAIS-R-to-WAIS-III Flynn effect is that the WAIS-III FS IQ score represents a different proportional composite of CHC abilities. More specifically, the WAIS-III reduced the proportional representation of Gc from 45.5% to 31.8%, decreased the Gv representation by approximately 5%, doubled the impact of Gs, and for the first time ever introduced close to 10% Gf representation. CHC content changes of the FS IQ scores between batteries may be at play.   Can anyone say "comparing apples to apples+oranges?"

And so on.................more comments may be forthcoming.

PS - additional information not included in this original post has now been posted.  Click here.



Technorati Tags: , , , , , , , , , , , , , , , , , , , , ,

Friday, January 15, 2010

Weiss & Daniel respond to "Wechsler-like IQ scaled score metric..." post

Below is a response to my prior post regarding Wechsler-like scaled score issues.  The response was on the NASP listserv and the authors gave me permission to reproduce it "as is" below.  I'm pleased that they concur with the recommendations at the end of the paper post.

Kevin McGrew's argument can be turned around to show that using subtest score metrics with larger SDs also may lead to misinterpretation if a change of 1 raw score point leads to a change of many standard score points. So, the issue is not as simple as which subtest metric is better (e.g, the Wechsler / Kaufman metric or the WJ metric). The issue is better framed in terms of making the right choice of metric based on how it fits with the underlying RS distribution. Appropriate fit between the RS and SS distributions is necessary to avoid
misinterpretation due to SS metrics that are either too large or small.

We agree with his suggested guidelines at the end of the full paper.


Larry Weiss
Mark Daniel
Pearson

Technorati Tags: , , , , , , , , ,

Tuesday, January 12, 2010

IAP Applied Psychometric 101 Brief reports section added to blog



A new section has been added to IQs Corner blog. This section is IAP Applied Psychometrics 101 Briefs. It can be found on the blog sidebar. These are brief reports that are posted at IQs Corner sister blog ICDP. Clicking on the link will take you to the ICDP blog page that contains the link to the brief report.

Technorati Tags: , , , , , , , , , , , , , , , , , ,