IQ's Corner: AP101 Briefs

Showing posts with label AP101 Briefs. Show all posts

Monday, September 10, 2012

AP101 Brief # 16: Beyond CHC: Within-CHC Domain Complexity Optimized Measures

[Note: This is a working draft of a larger paper (Implications of 20 years of CHC Cognitive-Achievement Research: Back-to-the-future and Beyond CHC) that will be presented at the first Inaugural Session of the Richard Woodcock Institute for Advancement of Contemporary Cognitive Assessment at Tufts University (Sept, 29, 2012): The Evolution of CHC Theory and Cognitive Assessment).] Working knowledge of the WJ III test batery will make this brief easier to understand, but is not necessary.

Beyond CHC: ITD—Within-CHC Domain Complexity Optimized Measures

Optimizing Cognitive Complexity of CHC measures

I have recently begun to recognize the contribution that The Brunswick Symmetry derived Berlin Intelligence Structure (BIS) model can make in applied intelligence research, especially for increasing predictor-criteria relations by maximizing these relations via matching the predictor-criteria space on the dimension of cognitive complexity. What is cognitive complexity? Why is it important? More important, what role should it play in designing intelligence batteries to optimize CHC COG-ACH relations?

Cognitive complexity is often operationalized by inspecting individual test loadings on the first principal component from principal component analysis (Jensen, 1998). The high g-test rationale is that performance on tests that are more cognitively complex “invoke a wider range of elementary cognitive processes (Jensen, 1998; Stankov, 2000, 2005)” (McGrew, 2010b, p. 452). High g-loading tests are often at the center of MDS (multidimensional scaling) radex models (click here for AP101 Brief Report #15: Cognitive-Aptitude-Achievement Trait Complexes example)—but this isomorphism does not always hold. David Lohman, a student of Richard Snow’s, has made extensive use of MDS methods to study intelligence and has one of the best grasps of what cognitive complexity, as represented in the hyperspace of MDS figures, contributes to understanding intelligence and intelligence tests. According to Lohman (2011), those tests closer to the center are more cognitively complex due five possible factors—larger number of cognitive component processes; accumulation of speed component differences: more important component processes (e.g., inference); increased demands of attentional control and working memory; and/or or more demands on adaptive functions (assembly, control, and monitoring). Schneider’s (in press) level of abstraction description of broad CHC factors is similar to cognitive complexity. He uses the simple example of 100 meter hurdle performance. According to Schneider (in press), one could independently measure 100 meter sprinting speed and then standing still and jumping over a hurdle (both examples of narrow abilities). However, running a 100 meter race is not the mere sum of the two narrow abilities and as is more of a non-additive combination and integration of narrow abilities. This analogy captures the essence of cognitively complexity—which, in the realm of cognitive measures, are tasks that have more of the five factors listed by Lohman involvedduring successful task performance.

Of critical importance is the recognition that factor or ability domain breadth (i.e., broad or narrow) is not synonymous with cognitively complexity. More important, cognitive complexity has not always been a test design concept (as defined by the Brunswick Symmetry and BIS model) explicitly incorporated into "intelligent" intelligence test design (ITD). A number of tests have incorporated the notion of cognitive complexity in their design plans, but I believe this type of cognitive complexity is different than the within-CHC domain cognitive complexity discussed here.

For example, according to Kaufman and Kaufman (2004), “in developing the KABC-II, the authors did not strive to develop ‘pure’ tasks for measuring the five CHC broad abilities. In theory, Gv tasks should exclude Gf or Gs, for example, and tests of other broad abilities, like Gc or Glr, should only measure that ability and none other. In practice, however, the goal of comprehensive tests of cognitive ability like the KABC-II is to measure problem solving in different contexts and under different conditions, with complexity being necessary to assess high-level functioning” (p. 16; italics emphasis added). Although the Kaufman’s address the importance of cognitively complex measures in intelligence test batteries, their CHC-grounded description defines complex measures as those that are factorially complex or mixed measures of abilities from more than one broad CHC domain. The Kaufman’s also address cognitive complexity from the non-CHC neurocognitve three-block functional Luria neurocognitive model when they indicate that it is important to provide measurement that evaluates the “dynamic integration of the three blocks” (Kaufman & Kaufman, 2004, p.13). This emphasis on neurocognitive integration (and thus, complexity) is also an explicit design goal of the latest Wechsler batteries. As stated in the WAIS-IV manual (Wechsler, 2008), “although there are distinct advantages to the assessment and division of more narrow domains of cognitive functioning, several issues deserve note. First, cognitive functions are interrelated, functionally and neurologically, making it difficult to measure a pure domain of cognitive functioning” (p. 2). Furthermore, “measuring psychometrically pure factors of discrete domains may be useful for research, but it does not necessarily result in information that is clinically rich or practical in real world applications (Zachary, 1900)” (Wechsler, p. 3). Finally, Elliott (2007) similarly argues for the importance of recognizing neurocognitive-based “complex information processing” (p. 15; italics emphasis added) in the design of the DAS-III, which results in tests or composites measuring across CHC-described domains, as important in test design.

The ITD principle explicated and proposed here is that of striving to develop cognitively complex measures within broad CHC domains—that is, not attaining complexity via the blending of abilities across CHC broad domains and not attempting to directly link to neurocognitive network integration.[1] The Brunswick Symmetry based BIS model provides a framework for attaining this goal via the development and analysis of test complexity by paying attention to cognitive content and operations facets.

Figure 12 presents the results of a 2-D MDS Radex model of most all key WJ III broad and narrow CHC cognitive and achievement clusters (for all norm subjects from approximately 6 years of age thru late adulthood). [2] The current focus of the interpretation of the results in Figure 12 is only on the degree of cognitive complexity (proximity to the center of the figure) of the broad and narrow WJ III clusters within the same domain (interpretations of the content and operations facets are not a focus of this current material). Within a domain the broadest three-test parent clusters are designated by black circles.[3] Two-test broad clusters are designed by gray circles. Two test narrow offspring clusters within broad domains are designated by white circles. All clusters within a domain are connected to the broadest parent broad cluster by lines. The critically important information is the within-domain cognitive complexity of the respective parent and sibling clusters as represented by their relative distances from the center of the figure. A number of interesting conclusions are apparent. [Click on image to enlarge]

First, as expected, the WJ III GIA-Ext cluster is almost perfectly centered in the figure—it is clearly the most cognitively complex WJ III cluster. In comparison, the three WJ III Gv clusters are much weaker in cognitive complexity than all other cognitive clusters with no particular Gv cluster demonstrating a clear cognitive complexity advantage. As expected, the measured reading and math achievement clusters are primarily cognitively complex measures. However, those achievement clusters that deal more with basic skills (Math Calculation—MTHCAL; Basic Reading Skills—RDGBS) are less complex that the application clusters (Reading Comprehension-RDGCMP; Math Reasoning-MTHREA).

The most intriguing findings in Figure 12 are the differential cognitive complexity patterns within CHC domains (with at least one parent and at least one offspring cluster). For example, the narrow Perceptual Speed (Gs-P) offspring cluster is more cognitively complex than the broad parent Gs cluster. The broad Gs cluster is comprised of the Visual Matching (Gs-P) and Decision Speed (Gs-R9; Glr-NA) tests, tests that measure different narrow abilities. In contrast the Perceptual Speed cluster (Gs-P) is comprised of two tests that are classified as both measuring the same narrow ability (perceptual speed). This finding appears, on first blush, counterintuitive as one would expect a cluster comprised of tests that measure different content and operations (Gs cluster) would be more complex (as per the above definition and discussion) than one comprised of two measures of the same narrow ability (Gs-P). However, one must task analyze the two Perceptual Speed tests to realize that although both are classified as measuring the same narrow ability (perceptual speed), they differ in both stimulus content and cognitive operations. Visual Matching requires processing of numeric stimuli. Cross Out requires the processing of visual-figural stimuli. These are two different content facets in the BIS model. The Cross Out visual-figural stimuli are much more spatially challenging than the simple numerals in Visual Matching. Furthermore, the Visual Matching test requires the examinee to quickly seek out and discover and mark two digit pairs that are identical. In contrast, in the Cross Out test the subject is provided a target visual-figural shape and the subject must then quickly scan a row of complex visual images and mark two that are identical to the target. Interesting, in other unpublished analyses I have completed, the Visual Matching test often loads on or groups with quantitative achievement tests while Cross Out has frequently show to load on a Gv factor. Thus, task analysis of the content and cognitive operations of the WJ III Perceptual Speed tests suggests that although both are classified as narrow indicators of Gs-P, they differ markedly in task requirements. More important, the Perceptual Speed cluster tests, when combined, appear to require more cognitively complex processing than the broad Gs cluster. This finding is consistent with Ackerman, Beier and Boyle’s (2002) research that suggests that perceptual speed has another level of factor breadth via the identification of four subtypes of perceptual speed (i.e., pattern recognition, scanning, memory and complexity; see McGrew 2005 and Schneider & McGrew, 2012 for discussion of a hierarchically organized model of speed abilities). Based on Bruinswick Symmetry/BIS cognitive complexity principles, one would predict that a Gs-P cluster comprised of two parallel forms of the same task (e.g., two Visual Matching or two Cross Out tests) would be less cognitively complex than broad Gs. A hint of the possible correctness of this hypothesis is present in the inspection of the Gsm-MS-MW domain results.

The WJ III Gsm cluster is the combination of the Numbers Reversed (MW) and Memory for Words (MS) tests. In contrast, the WJ III Auditory Memory Span cluster (AUDMS; Gsm-MS) cluster is much less cognitively complex when compared to Gsm (see Figure 12). Like the Perceptual Speed (Gs-P) cluster described in the context of the processing speed family of clusters, the Auditory Memory Span cluster is comprised of two tests with the same memory span (MS) narrow ability classification (Memory for Words; Memory for Sentences). Why is this narrow cluster less complex than its broad parent Gsm cluster while the opposite held true for Gs-P and Gs? Task analysis suggests that the two memory span tests are more alike than the two perceptual speed tests. The Memory for Words and Memory Sentences tests require the same cognitive operation—simply repeating back, in order, words or sentences spoken to the subject. This differs from the WJ III Perceptual Speed cluster as the similarly classified narrow Gs-P tests most likely invoke both common and different cognitive component operations. Also, the Memory Span cluster tests are comprised of stimuli from the same BIS content facet (i.e., words and sentences; auditory-linguistic/verbal). In contrast, the Gs-P Visual Matching and Cross Out tests involve two different content facets (numeric and visual-figural).

In contrast, the WJ III Working Memory cluster (Gsm-MW) is more cognitively complex than the parent Gsm cluster. This finding is consistent with the prior WJ III Gs/Perceptual Speed and WJ III Gsm/Auditory Memory Span discussion. The WJ III Working Memory cluster is comprised of the Numbers Reversed and Auditory Working Memory tests. Numbers Reversed requires the processing of stimuli from one BIS content facet—numeric stimuli. In contrast, Auditory Working Memory requires the processing of stimuli from two BIS content factors—numeric and auditory-linguistic/verbal; numbers and words). The cognitive operations of the two tests also differ. Both require the holding of the presented stimuli in active working memory space. Numbers Reversed then requires the simple reproduction of the numbers in reverse order. In contrast, the Auditory Working Memory test requires the storage of the numbers and words in separate chunks, and then the production of the forward sequence of each respective chunk (numbers or words), one chunk before the other. Greater reliance on divided attention is most likely occurring during the Auditory Working Memory test.

In summary, the results presented in Figure 12 suggest that it is possible to develop cluster scores that vary by degree of cognitively complexity within the same broad CHC domain. More important is the finding that the classification of clusters as broad or narrow does not provide information on the measures cognitive complexity. Cognitively complexity, as defined in the classification of clusters as broad or narrow does not provide information on the measures cognitive complexity. Cognitive complexity, as in the Lohman sense, can be achieved within CHC domains without resorting to mixing abilities across CHC domains. Finally, narrow clusters can be more cognitively complex, and thus likely better predictors of complex school achievement, than broad clusters or other narrow clusters.

Implications for Test Battery Design and Assessment Strategies

The recognition of cognitive complexity as an important ITD principle suggests that the push to feature broad CHC clusters in contemporary test batteries, or in the construction of cross-battery assessments, fails to recognize the importance of cognitive complexity. I plead guilty to contributing to this focus via my role in the design of the WJ III which focused extensively on broad CHC domain construct representation—most WJ III narrow CHC clusters require the use of the third WJ III cognitive book (the Diagnostic Supplement; Woodcock, McGrew, Mather & Schrank, 2003). Similarly, guilty as charged in the dominance of broad CHC factor representation in the development of the original cross-battery assessment principles (Flanagan & McGrew, 1997; McGrew & Flanagan, 1998).

It is also my conclusion that the narrow is better conclusion of McGrew and Wendling (2010) may need modification. Revisiting the McGrew and Wendling (2010) results suggest that the narrow CHC clusters that were more predictive of academic achievement likely may have been so not necessarily because they are narrow, but because they are more cognitively complex. I offer the hypothesis that a more correct principle is that “cognitively complex measures” are better. I welcome new research focused on testing this principle.

In retrospect, given the universe of WJ III clusters, a broad+narrow hybrid approach to intelligence battery configuration (or cross-battery assessment) may be more appropriate. Based exclusively on the results presented in Figure 12, the following clusters would appear those that might better be featured in the “front end” of the WJ III or a selective testing constructed assessment—those clusters that examiners should consider first within each CHC broad domain: Fluid Reasoning (Gf)[4], Comprehension-Knowledge (Gc), Long-term Retrieval (Glr), Working Memory (Gsm-MW), Phonemic Awareness 3 (Ga-PC), and Perceptual Speed (Gs-P). No clear winner is apparent for Gv, although the narrow Visualization cluster is slightly more cognitively complex than the Gv and Gv3 clusters. The above suggests that if broad clusters are desired for the domains of Gs, Gsm and Gv, then additional testing beyond the “front end” or featured tests and clusters would require administration of the necessary Gs (Decision Speed), Gsm (Memory for Words) and Gv (Picture Recognition) tests.

Utilization of the ITD test design principle of optimizing within-CHC cognitively complexity of clusters suggests that a different emphasis and configuration of WJ III tests might be more appropriate. It is proposed that the above WJ III cluster complexity priority or feature model would likely allow practitioners to administer the best predictors of school achievement. I further hypothesize that this cognitive complexity based broad+narrow test design principle most likely applies to other intelligence test batteries that have adhered to the primary focus on featuring tests that are the purest indicators of two or more narrow abilities within the provided broad CHC interpretation scheme. Of course, this is an empirical question that begs research with other batteries. More useful with be similar MDS Radex cognitive complexity analysis of cross-battery intelligence data sets.[5]

References (not included in this post. The complete paper will be announced and made available for reading and download in the near future)

[1] This does not mean that cognitive complexity may not be related to the integrity of the human connectome or different brain networks. I am excited about contemporary brain network research (Bressler & Menon, 2010; Cole, Yarkoni, Repovs, Anticevic & Braver, 2012; Toga, Clark, Thompson, Shattuck, & Van Horn, 2012; van den Heuvel & Sporns, 2011), particularly that which has demonstrated links between neural network efficiency and working memory, controlled attention and clinical disorders such as ADHD (Brewer, Worunsky, Gray, Tang, Weber & Kober, 2011; Lutz, Slagter, Dunne, & Davidson, 2008; McVay & Kane, 2012). The Parietal-Frontal Integration (P-FIT) theory of intelligence is particularly intriguing as it has been linked to CHC psychometric measures (Colom, Haier, Head, Álvarez-Linera, Quiroga, Shih, & Jung, 2009; Deary, Penke, & Johnson, 2010; Haier, 2009; Jung & Haier, 2007) and could be linked to CHC cognitively-optimized psychometric measures.

[2] Only reading and math clusters were included to simplify the presentation of the results and the fact, as reported previously, that reading and writing measures typically do not differentiate well in multivariate analysis—and thus the Grw domain in CHC theory.

[3] GIA-Ext is also represented by a black circle.

[4] Although the WJ III Fluid Reasoning 3 cluster (Gf3) is slightly closer to the center of the figure, the difference from Fluid Reasoning (Gf) is not large and time efficiency would argue for the two-test Gf cluster.

[5] It is important to note that the cognitive complexity analysis and interpretation discussed here is specific to within the WJ III battery only. The degree of cognitive complexity in the WJ III cognitive clusters in comparison to composite scores from other intelligence batteries can only be ascertained by cross-battery MDS complexity analysis.

Thursday, March 01, 2012

IAP101 Brief #12: Use of IQ component part scores as indicators of general intelligence in SLD and MR/ID diagnosis

Historically the concept of general intelligence (g), as operationalized by intelligence test battery global “full scale” IQ scores, has been central to the definition and classification of individuals with a specific learning disability (SLD) as well as individuals with an intellectual disability (ID). More recently, contemporary definitions and operational criteria have elevated intelligence test battery composite or part scores to a more prominent role in diagnosis and classification of SLD and more recently in ID.

In the case of SLD, third-method “consistency” definitions prominently feature component or part scores in (a) the identification of consistency between low achievement and relevant cognitive abilities or processing disorders and (b) the requirement that an individual demonstrate relative cognitive and achievement strengths (see Flanagan, Fiorello & Ortiz, 2010). The global IQ score is de-emphasized in the third-method SLD methods.

In contrast, the 11^th edition of the AAIDD Intellectual Disability: Definition, Classification, and Systems of Supports manual (AAIDD, 2010) placed general intelligence, and thus global composite IQ scores, as central to the definition of intellectual functioning. This has not been without challenge. For example, the AAIDD ID definition has been criticized for an over-reliance on the construct of general intelligence and for ignoring contemporary psychometric theoretical and empirical research that has converged on a multidimensional hierarchical model of intelligence (viz., Cattell-Horn-Carroll or CHC theory).

The potential constraints of the “ID-as-a-general-intelligence-disability” definition was anticipated by the Committee on Disability Determination for Mental Retardation, in its National Research Council report “Mental Retardation: Determining Eligibility for Social Security Benefits” (Reschly, Meyers & Hartel, 2001). This national committee of experts concluded that “during the next decade, even greater alignment of intelligence tests and the IQ scores derived from them and the Horn-Cattell and Carroll models is likely. As a result, the future will almost certainly see greater reliance on part scores, such as IQ scores for Gc and Gf, in addition to the traditional composite IQ. That is, the traditional composite IQ may not be dropped, but greater emphasis will be placed on part scores than has been the case in the past” (Reschly et al., 2002, p. 94). The committee stated that “whenever the validity of one or more part scores (subtests, scales) is questioned, examiners must also question whether the test’s total score is appropriate for guiding diagnostic decision making. The total test score is usually considered the best estimate of a client’s overall intellectual functioning. However, there are instances in which, and individuals for whom, the total test score may not be the best representation of overall cognitive functioning.” (p. 106-107).

The increased emphasis on intelligence test battery composite part scores in SLD and ID diagnosis and classification raises a number of measurement and conceptual issues (Reschly et al., 2002). For example, what are statistically significant differences? What is a meaningful difference? What appropriate cognitive abilities should serve as proxies of general intelligence when the global IQ is questioned? What should be the magnitude of the total test score?

Appropriate cognitive abilities will only be the only issue discussed here. This issue addresses which component or part scores are more correlated with general intelligence (g)—that is, what component part scores are high g-loaders? The traditional consensus has been that measures of Gc (crystallized intelligence; comprehension-knowledge) and Gf (fluid intelligence or reasoning) are the highest g-loading measures and constructs and are the most likely candidates for elevated status when diagnosing ID (Reschly et al., 2002). Although not always stated explicitly, the third method consistency SLD definitions specify that an individual must demonstrate “at least an average level of general cognitive ability or intelligence” (Flanagan et al., 2010, p.745), a statement that implicitly suggests cognitive abilities and component scores with high g-ness.

Table 1 is intended to provide guidance when using component part scores in the diagnosis and classification of SLD and ID (click on images to enlarge and use the browser zoom feature to view; it is recommended you click here to access a PDF copy of the table..and also zoom in on it). Table 1 presents a summary of the comprehensive, nationally normed, individually administered intelligence batteries that possess satisfactory psychometric characteristics (i.e., national norm samples, adequate reliability and validity for the composite g-score) for use in the diagnosis of ID and SLD.

The “Composite g-score” column lists the global general intelligence score provided by each intelligence battery. This score is the best estimate of a person’s general intellectual ability, which currently is most relevant to the diagnosis of ID as per AAIDD. All composite g-scores listed in Table 1 meet Jensen’s (1998) psychometric sampling error criteria as valid estimates of general intelligence. As per Jensen’s number of tests criterion, all intelligence batteries g-composites are based on a minimum of nine tests that sample at least three primary cognitive ability domains. As per Jensen’s variety of tests criterion (i.e., information content, skills and demands for a variety of mental operations), the batteries, when viewed from the perspective of CHC theory, vary in ability domain coverage—four (CAS, SB5), five (KABC-II, WISC-IV, WAIS-IV), six (DAS-II) and seven (WJ III) (Flanagan, Ortiz & Alfonso, 2007; Keith & Reynolds, 2010). As recommended by Jensen (1998), “the particular collection of tests used to estimate g should come as close as possible, with some limited number of tests, to being a representative sample of all types of mental tests, and the various kinds of test should be represented as equally as possible” (p. 85). Users should consult sources such as Flanagan et al. (2007) and Keith and Reynolds, 2010) to determine how each intelligence battery approximates Jensen’s optimal design criterion, the specific CHC domains measured, and the proportional representation of the CHC domains in each batteries composite g-score.

Also included in Table 1 are the component part scales provided by each battery (e.g., WAIS-IV Verbal Comprehension Index, Perceptual Reasoning Index, Working Memory Index, and Processing Speed Index), followed by their respective within-battery g-loadings.[1] Examination of the g-ness of composite scores from existing batteries (see last three columns in Table 1) suggests the traditional assumption that measures of Gf and Gc are the best proxies of general intelligence may not hold across all intelligence batteries.[2]

In the case of the SB5, all five composite part scores are very similar in g-loadings (h² = .72 to .79). No single SB5 composite part score appears better than the other SB5 scores for suggesting average general intelligence (when the global IQ score is not used for this purpose). At the other extreme is the WJ III where the Fluid Reasoning, Comprehension-Knowledge, Long-term Storage and Retrieval cluster scores are the best g-proxies for part-score based interpretation within the WJ III. The WJ III Visual Processing and Processing Speed clusters are not composite part scores that should be emphasized as indicators of general intelligence. Across all batteries that include a processing speed component part score (DAS-II, WAIS-IV, WISC-IV, WJ III) the respective processing speed scale is always the weakest proxy for general intelligence and thus, would not be viewed as a good estimate of general intelligence.

It is also clear that one cannot assume that composites with similar sounding names of measured abilities should have similar relative g-ness status within different batteries. For example, the Gv (visual-spatial or visual processing) clusters in the DAS-II (Spatial Ability), SB5 (Visual-Spatial Processing) are relatively strong g-measures within their respective battery, but the same cannot be said for the WJ III Visual Processing cluster. Even more interesting are the differences in the WAIS-IV and WISC-IV relative g-loadings for similarly sounding index scores.

For example, the Working Memory Index is the highest g-loading component part score (tied with Perceptual Reasoning Index) in the WAIS-IV but is only third (out of four) in the WISC-IV. The Working Memory Index is comprised of the Digit Span and Arithmetic subtests in the WAIS-IV and the Digit Span and the Letter-Number Sequencing subtests in the WISC-IV. The Arithmetic subtest has been reported to be a factorially complex test which may tap fluid intelligence (Gf-RQ—quantitative reasoning), quantitative knowledge (Gq), working memory (Gsm), and possible processing speed (Gs; Keith & Reynolds, 2010; Phelps, McGrew, Knopik & Ford, 2005). The factorially complex characteristics of the Arithmetic subtest (which, in essence, makes it function like a mini-g proxy) would explain why the WAIS-IV Working Memory Index is a good proxy for g in the WAIS-IV but not in the WISC-IV. The WAIS-IV and WISC-IV Working Memory Index scales, although named the same, are not measuring identical constructs.

A critical caveat is that the g-loadings cannot be compared across different batteries. g-loadings may change when the mixture of measures included in the analyses change. Different "flavors" of g can result (Carroll, 1993; Jensen, 1998). The only way to compare the g-ness across batteries is with appropriately designed cross- or joint-battery analysis (e.g., WAIS-IV, SB5 and WJ III analyzed in a common sample).

The above within and across intelligence battery examples illustrates that those who use component part scores as an estimate of a person’s general intelligence must be aware of the composition and psychometric g-ness of the component scores within each intelligence battery. Not all component part scores in different intelligence batteries are created equal (with regard to g-ness). Also, not all similarly named factor-based composite scores may measure the same identical construct and may vary in degree of within battery g-ness. This is not a new problem in the context of naming factors in factor analysis, and by extension, factor-based intelligence test composite scores, Cliff (1983) described this nominalistic fallacy in simple language—“if we name something, this does not mean we understand it” (p. 120).

[1] As noted in the footnotes in Table 1, all composite score g-loadings were computed by Kevin McGrew by entering the smallest number (and largest age ranges covered) of the published correlation matrices within each intelligence batteries technical manual (note the exception for the WJ III) in order to obtain an average g-loading estimate. It would have been possible to calculate and report these values for each age-differentiated correlation matrix for each intelligence battery. However, the purpose of this table is to provide the best possible average value across the entire age-range of each intelligence battery. Floyd and colleagues have published age-differentiated g-loadings for the DAS-II and WJ III. Those values were not used as they are based on the use of the principal common factor analysis method, a method that analyzes the reliable shared variance among tests. Although principal factor and principal component loadings typically will order measures in the same relative position, the principal factor loadings typically will be lower. Given that the imperfect manifest composite scale scores are those that are utilized in practice, and to also allow uniformity in the calculation of the g-loadings reported in Table 1, principal component analysis was used in this work. The same rationale was used for not using the latent factor loadings on a higher-order g-factor in SEM/CFA analysis of each test battery. Loadings from CFA analyses represent the relations between the underlying theoretical ability constructs and g purged of measurement error. Also, frequently the final CFA solutions reported in a batteries technical manual (or independent journal articles) allow tests to be factorially complex (load on more than one latent factor), a measurement model that does not resemble the real world reality of the manifest/observed composite scores used in practice. Latent factor loadings on a higher-order g-factor will often differ significantly from principal component loadings based on the manifest measures, both in absolute magnitude and relative size (e.g., see high Ga loading on g in WJ III technical manual which is at variance with the manifest variable based Ga loading reported in Table 1)

[2] The h²values are the values that should be used to compare the relative amount of g-variance present in the component part scores within each intelligence battery.

Tuesday, February 07, 2012

IAP Applied Psychometrics 101 Brief Report # 11: What is the typical IQ and adaptive behavior correlation?

What is the typical relation (correlation) between standardized measures of adaptive behavior (AB) and measures of intelligence (IQ)? This is an important question given the role both play in the definition diagnosis of mental retardation (MR) / intellectual disability (ID).

During the late 1970's and 1980's this was an active area of research. Numerous studies were published that reported correlations between a wide variety of adaptive behavior scales and intelligence tests. Probably the best synthesis of this research was provided by Harrison (1987). Harrison's review included a table of over 40+ correlations. This is Table 2 in the above referenced and linked article. Harrison concluded, as have most others who have reviewed the literature, that "the majority of correlations fall in the moderate range" (p.39). When the correlations with maladaptive measures are excluded from Harrison's table, the correlations range from .03 to .91. This is a wide range. Harrison could not identify a specific explanation for the variability or range of correlations. Harrison speculated that variables might impact the magnitude of the correlations were the specific adaptive behavior or measure of intelligence used and differences in sample variability.

Subsequently the Committee on Disability Determination for Mental Retardation published a National Research Council report (Mental Retardation: Determining Eligibility for Social Security Benefits; Reschly, Meyers & Hartel, 2001) that also addressed the AB/IQ relation. The report concluded that AB/IQ studies report correlations "ranging from 0 (indicating no relationship) to almost +1 (indicating a perfect relationship). Data also suggest that the relationship between IQ and adaptive behavior varies significantly by age and levels of retardation, being strongest in the severe and moderate ranges and weakest in the mild range. There is a dearth of data on the relationship of IQ and adaptive behavior functioning at the mild level of retardation" (p. 8). Factors identified as moderating the AB/IQ correlation were scale content, measurement of competences versus perceptions, sample variability, ceiling and floor problems of the scales, and level of mental retardation.

Given the above, it is hard to render an objective statement on the approximate typical AB/IQ correlation. With this in mind, an informal research synthesis was completed and is reported here.

First, only the AB/IQ correlations (IQ/maladaptive correlations were excluded) from Harrison's 1987 table were extracted (n = 43 correlations). Then, the technical manuals for the current editions of the three most frequently used contemporary adaptive behavior scales were reviewed for additional correlations. This included the Vineland Adaptive Behavior Scale (Sparrow, Cicchetti & Balla, 2005; n = 2 correlations of .12, .20) and the Adaptive Behavior Scales--II (Harrison & Oakland, 2008; n = 10 correlations ranging from .39 to .67; median = .51).

Although six different correlations were reported in the Scales of Independent Behavior-Revised manual (SIB-R; Bruininks, Woodcock, Weatherman & Hill, 1996), the values were not used as they are inflated estimates when compared to the type of correlations typically reported. For example, very high correlations of .79, .82 and .91 are reported for certain groups. A close reading of the tables reveals that the SIB-R correlations with either the WJ or WJ-R intelligence test were calculated on the basis of the W-score growth metric. By definition, a growth metric includes age variance. If correlations are reported across wide age groups the correlations convey variance related to the correlation between the AB and IQ constructs but also contains shared variance due to the influence of general age-base development (age). Thus, the SIB and SIB-R correlations with IQ, although not wrong and providing different information, are not comparable to all other reported correlations where age variance has been removed (typically by correlating age-based standard scores). Clear evidence for this point comes from McGrew and Bruininks (1990) who used the same SIB/WJ subject data reported in the SIB and SIB-R manuals, but who removed the W-score confounded age variance prior to the calculation of latent factor correlations (via confirmatory factor analysis) between latent practical intelligence (SIB adaptive behavior) and conceptual intelligence (WJ IQ) factors. The resulting AB/IQ correlations for three different age groups were .38, .56 and .58--far below the values in the .70 to .92 range. Thus, the values from McGrew and Bruininks (1990) were included for estimates of the SIB/SIB-R IQ correlations in the current synthesis.

Finally, latent AB/IQ correlations (as estimated from confirmatory factor analysis models) of .27 and .39 were included from Ittenbach, Spiegel, McGrew and Bruininks (1992) and Keith, Fehrmann,Harrison and Pottebaum (1987), respectively. This process resulted in the addition of 17 AB/IQ correlations to the 43 from Harrison, for a total of 60 correlations.

Descriptive statistics for this collection of 60 AB/IQ correlations are as follows: range of correlations from .12 to .90, a mean of .51 and a median of .48, and a standard deviation of .20. Below is a figure that includes a frequency polygon (and smoothed normal curve overlay) and a box-whisker plot of the data set. A review of the box and whisker plot (at the bottom) shows the median correlation (.48) as a vertical line within the rectangle. The rectangle includes the 50% middle of the distributions of correlations and shows an approximate range of just below .40 to just above .65. Of particular note is the shape of the frequency polygon and smoothed normal curve. The shape of the frequency polygon is consistent with a normal curve. In quantitative research synthesis this type of normal distribution suggests that total data set included in the review is not biased--both studies that are likely under- or overestimates of the "true" population correlation (due to method or sampling factors) are included. More importantly, the "bunching" up of the majority of the correlations in the middle provide confidence that the median of this distribution is a reasonable unbiased estimate of the populaiton correaltion. This type of relatively normal distribution suggests that the current collection of 60 AB/IQ correlations is likely a reasonable approximation of the complete set of population AB/IQ correlations.

Based on this informal (and admittedly incomplete review of all possible AB/IQ correlation research) one can conclude that a reasonable estimate of the typical AB/IQ correlation is approximately .50 (mean = .51; median = .48), with most ranging from approximately .40 to .65. This finding is consistent with Harrison's 1987 conclusion of a "moderate" correlation. The current analysis continues to reinforce Harrison's (and others) conclusions that adaptive behavior and intelligence are statistically related constructs, but they are still independent. An average correlation of .50 indicates that AB and IQ share approximately 25 % common variance (approximately 15% to 40 % common variance if one looks at the range of the 50% middle of the distribution of values). In practical terms this means that for any individual, standard scores from AB and IQ tests will frequently diverge and not always be consistent.

Harrison (1987) provides a nice explanation for the primary reasons for the moderate correlation between AB and IQ. Her quote is reproduced below

Numerous caveats need to be applied to this analysis and report. The most important are:

A comprehensive review of all possible published and unpublished AB/IQ research studies was not completed. Clearly there are more studies "out there" that could be added to the synthesis.
The analysis makes no attempt to determine if there are moderator effects. That is, is the typical correlation likely to systematically vary as a function of AB measures, IQ measures, variability in the sample's level of functioning, manifest/measured versus latent variable correlations, level of ability, etc.?
This has not been peer reviewed.

It is hoped that this ad hoc update of Harrison's (1987) review, augmented by quantitive organizational methods, will serve to stimulate a formal meta-analysis by others (hint---a nice study or thesis for someone?)