|Survey management of Work disability|
Nancy A. Mathiowetz
Joint Program in Survey MethodologyUniversity of Maryland, College Park
The collection of information about persons with disabilities presents a particularly complex measurement issue because of the variety of conceptual paradigms that exist, the complexity of the various paradigms, and the numerous means by which alternative paradigms have been operationalized in different survey instruments (see Chapter 2 by Jette and Badley for a review). For example, disability is often defined in terms of environmental accommodation of an impairment; hence, two individuals with the same impairment may not be similarly disabled or share the same perception of their impairment. For an individual with mobility limitations who lives in an assisted-living environment that accommodates the impairment, the environmental adaptations may result in little or no disability. The same individual living on the second floor of an apartment building with no elevator may have a very different perception of the impairment and may see him- or herself as disabled because of the environmental barriers that exist within his or her immediate environment.
The Social Security Administration (SSA) is currently reengineering its disability claims process for providing benefits to blind and disabled persons under the Social Security Disability Insurance (SSDI) and Supplemental Security Income (SSI) programs. As part of the effort to redesign the claims process, SSA has initiated a research effort designed to address the growth in disability programs, including the design and conduct of the Disability Evaluation Study (DES). The DES will provide SSA with comprehensive information concerning the number and characteristics of persons with impairments severe enough to meet SSA's statutory definition of disability, as well as the number and characteristics of people who are not currently eligible but who could be eligible as a result of changes in the disability decision process. For those years in which the DES is not conducted, SSA will need to monitor the potential pool of applicants. One means by which SSA can monitor the size and characteristics of potential beneficiaries is through other ongoing federal data collection efforts. For both the conduct of the DES and monitoring of the pool of potential beneficiaries through the use of various data collection efforts, it is critical to understand the measurement error properties associated with the identification of persons with disabilities as a function of the essential survey conditions under which the data have been and will be collected. The extent to which alternative instruments designed to measure persons with disabilities map to various eligibility criteria under consideration by SSA is also important.
The collection of disability data is an evolving field. Although a large and growing number of scales attempt to measure functional status and work disability, little is known about the measurement error properties of various questions and composite scales. The empirical literature provides clear evidence of variation in the estimates of the number of persons with disabilities in the United States, depending upon the conceptual paradigm of interest, the analytic objectives of the particular measurement process, and the essential survey conditions under which the information is collected (e.g., Haber, 1990; McNeil, 1993; Sampson, 1997). This literature suggests that estimates of the disabled population not only are related to the conceptual framework underlying the measurement construct but are also a function of the essential survey conditions under which the measurement occurred, including the specific questions used to measure disability, the context of the questions, the source of the information (self- versus proxy response), variations in the mode and method of data collection, and the sponsor of the data collection effort. Furthermore, terms such as impairment, disability, functional limitation, and participation are often inconsistently used, resulting in different and conflicting estimates of prevalence. Attempts to measure not only the prevalence but also the severity of an impairment or disability further complicate the measurement process.
Recent shifts in the conceptual paradigm of disability, in which disability is viewed as a dynamic process rather than a static measure and as an interaction between an individual with an impairment and the environment rather than as a characteristic only of the individual, imply that those responsible for the development of disability measures must separate the measurement of the impact of environmental factors in the enablement-disablement process from the measurement of ability. Viewing disability as a dynamic state resulting from an interaction between a person's impairment and a particular environmental context further complicates the assessment of the quality of various survey measures of disability, specifically, the reliability of a measure. As a dynamic characteristic, one would anticipate changes in the reports of disability as a function of changes in the individual as well as changes in the social and environmental contexts. The challenge for the measurement process is to disentangle true change from unreliability.
This workshop comes at a time when the federal government is undertaking several initiatives with respect to the measurement of disability in federal data collection efforts. The Americans with Disability Act of 1990 (ADA) defines disability as (1) a physical or mental impairment that substantially limits one or more of the major life activities of the individual, (2) a record of a substantially limiting impairment, or (3) being regarded as having a substantially limiting impairment. Although the measurement of disability within household surveys is not bound by the ADA definition, the passage of the ADA provides a socioenvironmental framework for how society comprehends and uses terms such as disability and impairment (e.g., the popular press and court rulings on ADA-related litigation). These definitions will evolve as a function of litigation related to ADA legislation and presentation of that litigation in the press. Hence, society is entering a period in which potential dynamic shifts in the comprehension and interpretation of the language associated with the measurement of persons with disabilities can be anticipated.
The paper presented in this chapter is intended to serve as a means of facilitating discussion among individuals from diverse theoretical and empirical disciplines concerning the methodological issues related to the measurement of persons with disabilities. As a first step to achieving this goal, a common language and framework needs to be established for the enumeration and assessment of the various sources of error that affect the survey measurement process. The chapter draws from several empirical investigations to provide evidence as to the extent of knowledge concerning the error properties associated with various approaches to the measurement of functional limitations and work disability.
For the purpose of defining a framework that can be used to examine error associated with the measurement of persons with disabilities, I draw upon the conceptual structure and language used by Groves (1989), based on earlier work of Kish (1965) and used by Andersen et al. (1979). Suchman and Jordan (1990) have described errors in surveys as the discrepancy between the concept of interest to the researcher and the quantity actually measured in the survey. Bias, according to Kish (1965, p. 509), refers to systematic errors in a statistic that affect any sample taken under a specified survey design with the same constant error or, as stated by Groves (1989), is the type of error that affects the statistic in all implementations of a survey. Variable errors are those errors that are specific to a particular implementation of a design, that is, specific to the particular trial. The concept of variable error requires the possibility of repeating the survey, with changes in the units of replication, that is, the particular set of respondents, interviewers, supervisors, coding, editing, and data entry staff.
Within the framework of survey methodology, both variable error and bias are further characterized in terms of errors of nonobservation and errors of observation. As one would expect from the term, errors of nonobservation reflect failure to obtain observations for some segment of the population or for all elements to be measured. Errors of nonobservation are most often classified as arising from three sources: sampling, coverage, and nonresponse.
Sampling error represents one type of nonobservation variable error; it arises from the fact that measurements (observations) are taken for only a subset of the population. Sampling variance refers to changes in the value of some statistic over possible replications of a survey in which the sample design is fixed but different individuals are selected for the sample. Estimates based on a particular sample will not be identical to estimates based on a different subset of the population (selected in the same manner) or to estimates based on the full population.
Coverage error defines the failure to include all eligible population members on the list or frame used to identify the population of interest. Those members not identified on the frame have a zero probability of selection and are never measured. For example, in the United States, approximately 5 percent of the population live in households without telephone service; any survey that is conducted by telephone and that attempts to describe the entire household-based population of the United States therefore suffers from coverage error. To the extent that those without telephones differ from those with telephones for the construct of interest, the resulting estimates will be biased.
Nonresponse error can arise from failure to obtain any information from the persons selected to be measured (unit nonresponse) or from failure to obtain complete information from all respondents to a particular question (item nonresponse). The extent to which nonresponse affects survey statistics is a function of both the rate of nonresponse and the difference between respondents and nonrespondents, as illustrated in the following formula:
yr = the statistic estimated from the r respondents,
yn = the statistic estimated from all n sample cases,
ynr = the statistic estimated from the nr nonrespondents, and
nr = the proportion of nonrespondents.
Knowing the response rate is not sufficient to determine the level of nonresponse bias; studies with both high and low rates of nonresponse can suffer from nonresponse bias.
As noted by Groves and Couper (1998), it is useful to further distinguish among the types of unit nonresponse, each of which may be related to the failure to measure different types of persons. For most household data collection efforts involving interviewers, the final outcome of an interview attempt is often classified into one of the following four categories: completed or partial interview, refusal, noncontact, and other noninterview.1 Survey design features can affect the distribution of cases across the various categories. Noncontact rates are affected by the length of the field period (in which short field periods result in higher noncontact rates than longer field periods). Surveys that place greater demands on the respondent may suffer from higher refusal rates than less burdensome instruments. The choice of respondent rule affects the rate of nonresponse; designs that permit any knowledgeable adult within the household to serve as the respondent provide an interviewer with some flexibility, should one adult within the household refuse or be unable to participate. Field efforts that fail to accommodate non-English-speaking respondents or that focus their attention on frail subpopulations tend to experience higher rates of other noninterviews.
Observational errors can arise from any of the elements directly engaged in the measurement process, including the questionnaire, the respondent, and the interviewer, as well as the characteristics that define the measurement process (e.g., the mode and method of data collection). This section briefly reviews the theoretical framework and empirical findings related to the various sources of measurement error in surveys.
Tourangeau (1984) and others (see Sudman et al.  for a review) have categorized the survey question-and-answer process as a four-step process involving comprehension of the question, retrieval of information from memory, assessment of the correspondence between the retrieved information and the requested information, and communication of the response. In addition, the encoding of information, a process outside the control of the survey interview, determines a priori whether the information of interest is available for the respondent to retrieve.
Comprehension of the question involves the assignment of meaning to the question by the respondent. Ideally, the question will convey the meaning of interest to the researcher. However, several linguistic, structural, and environmental factors affect the interpretation of the question by the respondent. These factors include the specific wording of the question, the structure of the question, the order in which the questions are presented, the overall topic of the questionnaire, whether the question is read by the respondent (self-administration) or is presented to the respondent by an interviewer, and the mode of communication used by the interviewer (that is, telephone versus face-to-face presentation). The wording of a question is often seen as one of the major problems in survey research: although one can standardize the language read by the respondent or the interviewer, standardization of the language does not imply standardization of the meaning. For example, "Do you own a car?" appears to be a simple question from the perspective of semantics and structure. However, several of the words in the question are subject to variation in interpretation, including "you" (just the respondent or the respondent and his or her family), "own" (completely paid for, purchased as opposed to rented), and even the word "car" (does this include vans and trucks?). The goal for the questionnaire designer is to develop questions that exhaust the range of possible interpretations, making sure that the particular concept of interest is the concept that the respondent has in mind when responding to the item.
One source of variation in a respondent's comprehension of survey questions is due to differences in the perceived intent or meaning of the question. Perceived intent can be shaped by the sponsorship of the survey, the overall topic of the questionnaire, or the environment more immediate to the question of interest, such as the context of the previous question or set of questions or the specific response options associated with the question.
Once the respondent comprehends the question, he or she must retrieve the relevant information from memory, make a judgment as to whether the retrieved information matches the requested information, and communicate a response. Much of the measurement error literature has focused on the retrieval stage of the question-answering process, classifying the lack of reporting of an event as retrieval failure on the part of the respondent and comparing the characteristics of events that are reported with those that are not reported. Several factors have been found to be related to the quality of reporting, including the length of the reference period of interest and the salience of the information. For example, the literature suggests that the greater the length of the recall period, the greater the expected bias in the reporting of episodic information (e.g., Cannell et al., 1965; Sudman and Bradburn, 1973). Salience is hypothesized to affect the strength of the memory trace and, subsequently, the effort involved in retrieving the information from long-term memory. The weaker the trace, the greater the effort needed to locate and retrieve the information.
As part of the communication of the response, the respondent must determine whether he or she wishes to reveal the information as part of the survey process. Survey instruments often ask questions about socially and personally sensitive topics. It is widely believed and well documented that such questions elicit patterns of underreporting (for socially undesirable behavior and attitudes), as well as overreporting (for socially desirable behaviors and attitudes). The determination of social desirability is a dynamic process and is a function of the topic of the question, the immediate social context, and the broader social environment at the time the question is asked. Even if the respondent is able to retrieve accurate information, he or she may choose to edit this information at the response formation stage as a means of reducing the costs associated with revealing the information.
The use of proxy reporters, that is, asking individuals within sampled households to provide information about other members of the household, is a design decision that is often framed as a trade-off among costs, sampling errors, and nonsampling errors. The use of proxy informants to collect information about all members of a household can increase the sample size (and hence reduce the sampling error) at a lower marginal data collection cost than increasing the number of households. The use of proxy respondents also facilitates the provision of information for those who would otherwise be lost to nonresponse because of an unwillingness or inability to participate in the survey interview. However, the cost associated with the use of proxy reporting may be an increase in the rate of errors of observation associated with poorer-quality reporting for others compared with the quality that would have been obtained under a rule of all self-response.
Most of the evaluations of the quality of proxy responses compared with the quality of self reports have focused on the reporting of autobiographical information (e.g., Mathiowetz and Groves, 1985; Moore, 1988) with some recent investigations examining the convergence of self and proxy reports of attitudes (Schwarz and Wellens, 1997). The literature is, however, for the most part silent with respect to the quality of proxy reports for personal characteristics, the exception being a small body of literature that addresses self-reporting versus proxy reporting effects in the reporting of race/ethnicity (Hahn et al., 1996) and the reporting of activities of daily living (e.g., Mathiowetz and Lair, 1994; Rodgers and Miller, 1997). The findings suggest that proxy reports of functional limitations tend to be higher than self-reports; the research is inconclusive as to whether the discrepancy is a function of overreporting on the part of proxy informants, underreporting on the part of self-respondents, or both.
For interviewer-administered questionnaires, interviewers may affect the measurement processes in one of several ways, including:
The first two factors contribute to measurement error from a cognitive or psycholinguistic perspective in that different respondents are exposed to different stimuli; thus, variation in responses is, in part, a function of the variation in stimuli. All three factors suggest that the interviewer effect contributes to an increase in variable error across interviewers. If all interviewers erred in the same direction (or their characteristics resulted in errors of the same direction and magnitude), interviewer bias would result. For the most part, the literature indicates that among well-trained interview staff, interviewer error contributes to the overall variance of estimates as opposed to resulting in biased estimates (Lyberg and Kasprzyk, 1991).
Any data collection effort involves decisions concerning the features that define the overall design of the survey, referred to here as the "essential survey conditions." In addition to the sample design and the wording of individual questions and response options, these decisions include the following:
No single design feature is clearly superior with respect to overall data quality. For example, as noted above, interviewer variance is one source of variability that can be eliminated through the use of a self-administered questionnaire. However, the use of an interviewer may aid in the measurement process by providing the respondent with clarifying information or by probing insufficient responses. The use of a panel survey design, with repeated measurements with the same individuals, facilitates more efficient estimation of change over time (compared with the use of multiple cross-sectional samples); however, panel designs may be subject to higher rates of nonresponse (as a result of nonresponse at every round of data collection) or panel conditioning bias, an effect in which respondents alter their reporting behavior as a result of exposure to a set of questions during an earlier interview.
The following scenario is an illustration of statistical measures of error used by survey methodologists. Assume that the measure of interest is personal earnings among all adults in the United States. A "true value" exists if the construct of interest is carefully defined. The data will be collected as part of a household-based health survey being conducted by telephone. The decision to use the telephone for data collection implies that approximately 5 percent of the adults will not be eligible for selection. To the extent that the personal earnings of adults without telephones differ significantly from those with telephones, population-based estimates for the entire adult population will suffer from coverage bias. Similarly, not all eligible sample persons will participate in the interview because of refusal to cooperate, an inability on the part of the survey organization to contact the respondent, or other reasons, such as language barriers or poor health that limits participation. Once again, to the extent that the earnings of those who participate differ significantly from those who do not participate, population-based estimates of earnings will suffer from nonresponse bias.
If all respondents misreport their earnings, underreporting their earnings by 10 percent, and they consistently do so in response to repeated measures, the measure will be reliable but not valid and population estimates based on the question (e.g., population means) would be biased. However, multivariate model-based estimates that examine the relationship between earnings and human capital investment would not be biased, since all respondents erred in the same direction and relative magnitude. Differential response error, for example, the overreporting of earnings by low-income individuals and the underreporting of earnings by high-income individuals, may produce unbiased population estimates (e.g., mean earnings per person) but biased model-based estimates related to individual behavior.
The language and concepts of measurement error in psychometrics are different from the language and concepts used within the fields of survey methodology and statistics. The focus for psychometrics is on variable errors; from the perspective of classical true score theory, all questions produce unbiased estimates, but not necessarily valid estimates, of the construct of interest. The confusion arises in that both statistics and psychometrics use the terms validity and reliability to sometimes refer to very similar concepts and to sometimes refer to concepts that are quite different. Within psychometrics, the terms validity and reliability are used to describe two types of variable error. Validity refers to "the correlation between the true score and the respondent's answer over trials" (Groves, 1991, p. 8). The validity of a measure can be assessed only for the population, whereas the validity of both population estimates and individuals' responses presented in the survey methodological literature can be assessed.
Reliability refers to the ratio of the true score variance to the observed variance, where variance refers to variability over persons in the population and over trials within a person (Bohrnstedt, 1983). Once again, the measurement of reliability from this perspective does not facilitate measurement for a person but produces a measure of reliability specific to the particular set of individuals for whom the measurement was taken.
The psychometric literature identifies several means by which validity can be assessed; the choice of measures is, in part, a function of the purpose of the measurement. These measures of validity include content, construct, concurrent, predictive, and criterion. If one considers that the questions included in a particular instrument represent a sampling of all questions that could have been included to measure the construct of interest, content validity refers to the comprehensiveness as well as the relevance of those questions. Content validity refers to the extent to which the question or questions reflect the domain or domains reflected in the conceptual definition. Face validity refers to the extent to which each item appears to measure that which it purports to measure. Cognitive interviewing techniques that focus on the comprehension of items by respondents is, to some extent, a test of face validity.
Criterion-related validity evaluates the extent to which the measure of interest correlates highly with a "gold standard." The gold standard could consist of a different self-reported measure, a behavioral measure, or an observation or evaluation outside the measurement process (e.g., clinical evaluation). Criterion-related validity is further categorized as concurrent validity or predictive validity. Concurrent validity refers to the correlation between the item of interest and some other item, event, or behavior measured at the same point in time, whereas predictive validity refers to the correlation between an indicator measured at time t and some other measure, event, or behavior measured at time t + 1.
When no gold standard exists, validity is evaluated in terms of the correlation between the measure of interest and other measures, according to theory-based hypotheses. As noted by McDowell and Newall (1996), "construct validation begins with a conceptual definition of the topic or construct to be measured, indicating the internal structure of its components and the theoretical relationship of scale scores to external criteria" (p. 33).
Measures of reliability include internal consistency (often referred to as coefficient Alpha or Cronbach's Alpha), test-retest, and interrater reliability. Internal consistency measures the extent to which all items in a scale measure the same underlying concept; it is only applicable for multi-item Likert scales. The reliability coefficient is a function of both the extent to which the items are homogeneous and the number of items in the scale; the coefficient increases with an increase in either the homogeneity of the items or an increase in the number of items. Test-retest reliability involves the measurement of the same person under the same measurement conditions at two points in time and can be used for single-item measures, as well as multi-item scales.2 Interrater reliability refers to the consistency with which different raters or observers rating the same person agree with one another.
Returning to the example of the measurement of earnings to illustrate the measurement error properties of the construct in terms of psychometrics, assume that the question or questions designed to measure earnings are both comprehensive and relevant. Therefore, the questions would be assessed as having content validity (face validity). If, as noted above, all respondents underreported their earnings by 10 percent, the construct would have a lower score with respect to criterion validity, but since all respondents erred in the same direction and the same magnitude, the indicator would have construct validity. If repeated measurement resulted in consistent reports by all respondents, test-retest measures would indicate a high degree of reliability, not dissimilar to the conclusion drawn by statisticians.
Similar to any other measurement of persons via the survey process, the identification of persons with disabilities is subject to the various sources of error discussed above. The measurement of persons with disabilities raises particular challenges, in light of the complexity of the phenomenon of interest and the demands of the measurement process. Some of the various sources that may be of particular importance are highlighted.
The interactive nature of the survey interview places great demands on the sensory and physical resources of respondents. A face-to-face interview requires that the respondent have the capacity to hear the questions, respond orally, understand individual questions and response categories, and be able to maintain cognitive focus. In addition, the respondent must tolerate the physical demands of the interview, a task that may take up to an hour or two. Impairments or disabilities may limit a person's ability to participate in the survey process or limit access to the individual. The essential survey design features of a data collection effort can facilitate or limit access and participation of persons with disabilities. This is not unique to the measurement of persons with impairments or disabilities. The use of the telephone for data collection restricts the sample to those households with telephones; if the data collection by telephone does not accommodate the use of TTY technology, hearing-impaired individuals will also not be measured. Similarly, the use of self-administered paper and pencil questionnaires limits participation to those who are literate and whose vision permits the reading of the font size used on the questionnaire. The implementation of a self-response rule eliminates from measurement those for whom gatekeepers deny access and those, although they are willing to participate, who are unable to do so because of physical, mental, or emotional impairments or those for whom the barrier to participation is language, either their use of a different spoken language or their use of sign language.
From a cognitive perspective, the measurement of persons with disabilities offers particular challenges. First, one needs to understand how individuals encode information about impairments and disabilities. In addition, effective questionnaire design requires an understanding of how the encoding of the information varies according to perceptual perspective (self-response versus other response, nature of the relationship between the respondent and the person for whom they are reporting). Second, little is known about how ability (capacity) is measured independent of environmental context (participation).
Many of the questions and sets of questions used to measure impairments and disability are plagued by comprehension problems related to both semantic and lexical complexity. For example, questions concerning work disability are subject to comprehension problems with respect to the shared meaning of "work." As noted earlier, the respondent must infer whether limitations in the kind or amount of work include factors related to transportation and access to the workplace. The desire for parsimonious means by which an individual's status can be assessed with respect to impairments or particular functional limitations has led to the creation of "composite" screening questions that nevertheless represent a single question and that may therefore be cost-effective, even though they press against the limits of working memory.3
The response task requires the respondent to retrieve information, determine the relevance of that information to the posed question, and formulate a response. Often the respondent is limited in the form of the response to a simple classification (e.g., yes, limited in the kind or amount of work versus not limited) that fails to capture the full spectrum of the enablement-disablement process and the complexity of the phenomenon of interest. The mapping of this complex phenomenon to a limited number of response categories is most likely fraught with error.
The integration of theories of cognitive psychology with survey methodology has given rise to new methods of questionnaire design and evaluation. Many of the current measures of disability used in federal data collection efforts have not been subjected to testing methods common to new questions and questionnaires, for example, cognitive interviewing and behavior coding. Cognitive interviewing encompasses several techniques designed to elicit information about the respondent's comprehension of the question, the strategies by which the respondent attempts to retrieve information from memory, judgments as to whether the retrieved information meets the perceived goals of the question, and the formulation of responses. These techniques include the use of "think-aloud" protocols, follow-up probes, vignettes, and "sort-order" tasks (Forsyth and Lessler, 1991; Willis et al., 1991).
A small body of literature has attempted to address problems in the comprehension of functional limitation questions in community-based survey interviews through the use of cognitive interviewing techniques (Jobe and Mingay, 1990; Keller et al., 1993). The findings from these investigations of functional limitation questions by use of cognitive interviewing techniques suggest that respondents varied in their interpretation of terms, tended to emphasize capacity rather than actual performance, overlooked qualifying statements within the question, failed to remember the use of human assistance, or failed to remember help with specific activities.4
What is meant when an individual is asked to classify him- or herself or someone else with respect to disability? Although reliable measurement may call for the use of clear, unambiguous, and objective definitions, it is questionable whether these goals are achievable with respect to the measurement of disability. Disability is a dynamic concept related to an underlying interface between an individual, societal accommodations and barriers, cultural norms and expectations, and behavioral norms. The use of "fuzzy logic" in which attributes apply only partially to given individuals may be more appropriate than standard survey techniques for the classification of disability (Hahn et al., 1996).
Although theories from cognitive psychology can provide information about the different cognitive processes by which self and proxy reporters engage in the response formulation process, one can turn to theories from social cognition to understand how individuals classify themselves and each other with respect to social categories. Although social cognition draws heavily from the theory and methods of cognitive psychology, as a subfield its focal point is on social objects, specifically, individuals or groups of individuals.
As noted by Brewer,
In comparison to object categories, social categories have been postulated to be overlapping rather than hierarchically organized . . ., disjunctively rather than conjunctively defined . . . and more susceptible to accessibility effects. (Brewer, 1988, p. 1)
She further states that "social categories are assumed to be 'fuzzy sets' represented in the form of prototypical images rather than verbal trait lists" (Brewer, 1988, p. 10).
Social cognition also provides a theoretical perspective that provides information about divergent perspectives of actors and observers. The actor-observer difference suggests that actors draw on situational information to explain behavior at any given time, whereas observers use stable disposition properties of the actor to understand behavior (Jones and Nisbett, 1971). To the extent that proxy reporters view disabilities as stable as opposed to dynamic characteristics, one would anticipate discrepancies between self-reports and proxy reports.
Two sets of concepts drawn from social psychology are also useful for consideration with respect to the measurement of disability. The first is the concept of self; from a sociological perspective, self-conceptions involve three components: (1) how an individual sees him- or herself, (2) how other people actually see the individual, and (3) how the individual believes others see him or her (Rosenberg, 1990). The National Health Interview Survey-Disability Supplement (NHIS-D) and the National Organization on Disability/Harris Survey of Americans with Disabilities included questions that asked whether the respondent perceived that he or she had a disability and whether others perceived that the respondent had a disability. The second concept of interest involves the notion of social identity and the groups, statuses, and social categories to which the members of society are recognized as belonging. If the social identity category is ambiguous, the self-concept related to the social identity will also be ambiguous.
As noted by Jette and Badley in Chapter 2, the measurement of disability is often presented in surveys as an "all or nothing phenomenon." This approach to the measurement of disability assumes that (1) the respondent recognizes and identifies with the socially defined label and (2) is willing to reveal membership in the group. If disability were an "all-or-nothing" phenomenon, identification with the classification would be less ambiguous; however, as already noted, the enablement-disablement process is a dynamic one, subject to variation as a function of both self and society. To the extent that identification or affiliation with group membership carries with it any type of social stigma, willingness to reveal membership in the group also carries with it a social cost, not unlike other phenomena subject to social desirability bias.
Ambiguous social classification categories are also more likely to be subject to context effects; respondents use the specific wording of questions, the immediately prior questions, or the overall focus of the question as a means for interpreting questions on disability. From a theoretical perspective, it is not surprising to find that estimates of the number of persons with disabilities vary as a function of differences in the specific wording of the question, the number of questions used to determine the prevalence and severity of impairments and disabilities, the context of the questions immediately proximate to the question of interest, and the overall focus of the questionnaire (health versus employment versus program participation).
To date, most investigations with respect to the error properties associated with the measurement of persons with disabilities or the measurement of persons with work disabilities have focused on errors of observation, ignoring differences in estimates due to coverage error and nonresponse error. This review of the empirical literature is therefore focused on errors of observation. As an illustration of the type of empirical investigations concerning error in the measurement of disability, this section begins by examining the work that has been done to date with respect to measures of activities of daily living (ADL). The intent is to provide an illustration of the type of work that has been done (and not done) with respect to a frequently used measure of functional limitation. The focus is then turned to the measurement of persons with work disabilities.
Although there are several different measurement methods for the assessment of physical disability, one of the most often used (within the context of survey measurement) is the Index of Activities of Daily Living, often referred to as the Index of ADL (Katz et al., 1963). The index was originally developed to measure the physical functioning of elderly and chronically ill patients, but several national surveys of the general population administer the index to adults of all ages. The index assesses independence in six activities: bathing, dressing, toileting, transferring from a bed or chair, continence, and feeding. Despite its wide acceptance and use, the psychometric properties of the index have not been well documented. Brorsson and Asberg (1984) reported reliability scores of 0.74 to 0.88 (based on 100 patients). Katz et al. (1970) applied the Index of ADLs as well as other indexes to a sample of patients discharged from hospitals for the chronically ill and reported correlations between the index and a mobility scale and between the index and a confinement measure of 0.50 and 0.39, respectively. Most assessments of the Index of ADLs have examined the predictive validity of the index with respect to independent living (e.g., Katz and Akpom, 1976) or the length of hospitalization and discharge to home or death (e.g., Ashberg, 1987). These studies indicate relatively high levels of predictive validity.
Despite the psychometric findings, a growing body of survey literature suggests that the measurement of functional limitations via the use of ADL scales is subject to substantial amounts of measurement error and that measurement error is a significant factor in the apparent improvement or decline in functional health observed in longitudinal data. Jette (1994) found that minor changes in the wording of the questions resulted in significant differences in the percentage of the population identified as being limited. Rodgers and Miller (1997) directly compared responses by the same respondents (or more specifically, for the same target individuals) by using different sets of ADL items and across different modes.5 They conclude that the measurements of functional limitations with respect to counts of ADLs, indications of the use of assistive devices or personal help, and indications of any difficulty are all subject to large amounts of measurement error, of which a substantial portion is random error. Similar to other empirical work (e.g., Mathiowetz and Lair, 1994), their findings indicate that the use of proxy respondents results in higher levels of reporting, of which only 25 to 33 percent can be explained by demographic characteristics and health variables of the target individual. The finding suggests that higher levels of functional limitations reported by proxy respondents are not simply a result of selection bias, in which those with the most severe limitations are reported by proxy.6 Their analyses also suggest that there was no clear effect of mode of data collection on estimates of functional limitations.
As illustrative of the variability and lack of reliability that is evident in survey estimates of functional limitations, Tables 3-1 and 3-2 present findings from the 1990 decennial census and the Content Reinterview Survey (CRS) (U.S. Bureau of the Census, 1993; McNeil, 1993). The CRS was conducted approximately 5 to 9 months following the 1990 decennial census, with a sample of 15,000 housing units selected from among those housing units assigned to complete the long form of the census. With respect to mobility limitations, estimates from the two surveys appear to be similar (e.g., 2.03 versus 2.05 percent), but examination of the responses for individuals indicates a low rate of consistent responses (less than 50 percent) among those who reply affirmatively for either survey. With respect to personal care limitations, once again, a high rate of inconsistency in the responses is seen among individuals who respond affirmatively to the question in either survey. For example, among those 16 to 64 years of age, almost all (83.4 percent) of those who report a self-care limitation at the time of the census fail to report a self-care limitation in the CRS.
NOTE: The prevalence rate based on census: 2.03 percent, of which 49.0 percent were consistent responses. The prevalance rate based on the Content Reinterview Survey: 2.05 percent, of which 48.5 percent were consistent responses.
SOURCE: McNeil, 1993.
NOTE: The prevalence rate based on census: 2.9 percent, of which 16.6 percent were consistent responses. The prevalance rate based on the Content Reinterview Survey: 1.3 percent, of which 36.5 percent were consistent responses.
SOURCE: McNeil, 1993.
Comparison of the percentage of persons with mobility and self-care limitations from the two surveys is confounded by differences in the essential survey conditions under which the data were collected and that most likely contribute to the discrepancies evident in the data. These differences include:
Finally, the possibility that the lack of reliability is indicative of the occurrence of real change between the time of the census and the time of the CRS must also be considered.
Although one can enumerate possible sources that explain the low rate of consistency between the two surveys, the lack of experimental design does not permit the identification of the relative contributions of the various design features to the overall lack of stability of these estimates.
Empirical evidence shows that even when questions are administered under the same essential survey conditions, responses are subject to a high rate of inconsistency. This evidence comes from the administration of the same topical module on functional limitations and disability to respondents in the 19921993 panel of the Survey of Income and Program Participation. The module was administered between October 1993 and January 1994 (Time 1) and then again between October 1994 and January 1995 (Time 2). The context of the questionnaire is the same in both waves; the topical module is preceded by the core interview, which focuses on earnings, transfer income, program participation, and other forms of income. Information is collected for all members of the household, usually by having one person report for himself or herself and all other family members. In addition, information as to who served as the respondent is recorded; thus one can examine consistency in the reporting of information across time among all self-responses. Table 3-3 presents selected comparisons of functional limitations and sensory impairments reported at Time 1 with those reported at Time 2. The comparisons clearly reveal high levels of theoretical inconsistency, even among self-respondents. For example, among those who report an inability to walk at Time 1, only 70.3 percent report the same status at Time 2. Limiting the comparison to self-reports only does not greatly improve the consistency. Among self-reporters, 76.7 percent of those reporting inability to walk at Time 1 report the same status in the subsequent interview.
SOURCE: McNeil, 1998.
These empirical findings illustrate some of the error properties associated with the measurement of functional limitations and sensory impairments. The research indicates that despite psychometric measures that indicate a relatively high degree of reliability, survey applications offer several examples of low levels of reliability, even under conditions in which the essential survey conditions are held constant. Subtle changes in the wording of questions, the order of questions, or the immediate prior context offer further illustration of the lack of robustness of these items. Although one can enumerate all of the factors that may contribute to this volatility, the relative contributions of the various factors have not been experimentally determined.
The assessment of work disability in federal surveys has focused on variants of a limited number of questions, most of which concern whether the individual is limited in the kind or amount of work he or she is able to do or is unable to work at all because of a physical, mental, or emotional problem. Not dissimilar to the assessment of functional limitations, work disability is measured in data collection efforts that vary with respect to the essential survey conditions, the specific wording of questions, the number of questions asked, and the determination of severity, duration, and the use of assistive devices or environmental barriers. As McNeil (1993) points out, one of the problems with the current set of indicators designed to measure work disability is that many fail to acknowledge the role of environmental barriers and accommodations. He states:
Questions can be raised about the validity of data on persons who are "limited in kind or amount of work they can do" or are "prevented from working." The work disability questions make no mention of environmental factors, even though it is obvious that a person's ability to work cannot be meaningfully separated from his or her environment. Work may be difficult or impossible under one set of environmental factors but productive and rewarding under another. It would certainly be logical for a respondent to answer "no" to the question, "Do you have a condition that prevents you from working?" if the real reason he or she is not working is the inaccessibility of the transportation system or the lack of accommodations at the workplace. (pp. 34)
As noted in Chapter 2, the "fundamental conceptual issue of concern is that health-related restriction in work participation may not be solely or even primarily related to the health condition. . . ." One of the challenges facing questionnaire designers is the development of questions that match the conceptual framework of interest with respect to work disability, specifically, whether the focus is on the health condition that limits the individual's ability to perform specific tasks related to a specific job, the external factors related to the performance of work, other factors that affect participation in the work environment (e.g., transportation), or all three sets of factors.
Although McNeil (1993) raises questions concerning the validity of the work disability measures currently in use, several empirical investigations raise questions about the reliability of these measures, not unlike the findings with respect to the measurement of functional limitations and sensory impairments. Once again, it can be seen that differences in the wording of the questions, the context in which they are asked, the nature of the respondent, and other essential survey conditions, including the data collection organization and the sponsorship of the survey, may contribute to differences in estimates of the working-age disabled population.
Haber (1990, as revised from Haber and McNeil ) examined work disability from selected surveys between 1966 and 1988. He notes that "despite a high degree of consistency in the social and economic composition of the disabled population over a variety of studies, the overall level of disability prevalence has varied considerably" (p. 43). Haber's findings are reproduced in Table 3-4. The estimates from the various surveys represent differences in the year of administration, the wording of the questions, the overall content of the survey, the mode of administration, the organization collecting the information, and the organization sponsoring the study. Although the wording of the questions is quite similar across the various surveys, there are some minor differences in specific wording (e.g., differences with respect to the emphasis on a health condition) and the order of the questions (e.g., whether the questions begin, as in the NHIS, by asking about whether a health condition keeps the person from working or begin, as in the SSA surveys, by asking whether the person's health limits the kind or amount of work that the person can do). As is evident from Table 3-4, the survey's content appears to be related to the overall estimate; the lowest rates of work disability prevalence come from the Census and the March Supplement to the Current Population Survey (8.5 to 9.4 percent), and the highest rates come from the surveys sponsored by SSA (14.3 to 17.2 percent).
NOTE: SSA = Social Security Administration Disability Survey; SEO = Survey of Economic Opportunity; NHIS = National Health Interview Survey; SIE = Survey of Income and Education; March CPS = Annual March Supplement (Income Supplement) to the Current Population Survey; SIPP = Survey of Income and Program Participation.
SOURCE: Haber, 1990.
The lack of stability that was evident for estimates of mobility and self-care limitations between the 1990 census and the CRS is also evident for estimates of work disability. Table 3-5 presents the comparison of responses between the 1990 census and the CRS with respect to whether the person is limited in the kind of work, or the amount of work, or is prevented from working at a job because of physical, mental, or other health conditions. Once again, it can be seen that between one-third and almost one-half of the respondents are inconsistent in their responses.
NOTE: The prevalence rate based on census: 7.7 percent, of which 68 percent were consistent responses. The prevalance rate based on the Content Reinterview Survey: 9.7 percent, of which 54.5 percent were consistent responses.
SOURCE: McNeil, 1993.
More recent investigations have used the extensive data from NHIS-D to investigate alternative estimates of the population with work disabilities. The data also provide an opportunity to examine inconsistencies in the reporting of work disability and receipt of SSI or SSDI benefits. For example, LaPlante (1999) found that, based on the data from the NHIS-D, 9.5 million adults 18 to 64 years of age report being unable to work because of a health problem. Among these 9.5 million adults, 5.3 million (or 56 percent) do not report receipt of SSI or SSDI benefits. If one looks at those who report receiving SSI or SSDI benefits, 75 percent report that they are unable to work and 13 percent report that they are limited in the kind or amount of work that they can perform, but 12.3 percent who report receipt of benefits do not report any limitation with respect to work.
Although these variations in estimates derived from different surveys suggest instability in the estimates of the proportion of persons with work disabilities as a function of the wording of the question, the nature of the respondent, and the essential survey conditions under which the measurement was taken, they provide little information about measurement error within the framework of either survey statistics or psychometrics. Little is known about the validity of these items or the reliability of these items, whether one views validity from the perspective of survey statistics as deviations from the true value or from the perspective of psychometrics as criterion-related or construct validity. The relative contributions of various sources of error are, for the most part, unknown; it is only known that various combinations of design features produce different estimates. None of the studies address errors of nonobservation.
Jette and Badley point out in Chapter 2 the conceptual problems inherent in many questions designed to measure persons with work disabilities, including the failure of most questions to enumerate the separate elements related to the role of work. That failure is evident in most work disability screening questions designed to be administered to the general adult population. The gap between the conceptual framework and the questions used to screen for work disability, is illustrated by using questions from several federal data collection efforts.
The long form of the decennial census for the year 2000 includes the following questions:
Because of a physical, mental, or emotional condition lasting 6 months or more, does this person have any difficulty in doing any of the following activities: . . .
d. (Answer if this person is 16 years old or over.) Working at a job or business?
The respondent is to check a box corresponding to "Yes" or "No."
The question is complex for several reasons:
As with many single screening items, the question fails to address accommodations that facilitate participation or barriers that prohibit participation. For example, if an individual is currently employed in an environment that accommodates a health condition, the respondent must determine whether the person should be considered as having difficulty working, even though the present employment situation presents no difficulty to the person.
The NHIS asks two questions concerning work limitations:
Does any impairment or health problem NOW keep _______ from working at a job or business?
Is _______ limited in the kind OR amount of work _______ can do because of any impairment or health problem?
In contrast to the questions in the census long form, the NHIS questions do not enumerate the various areas of health for consideration, nor does either question include a qualifying statement with respect to duration. The two questions are more specific in addressing the impact on working; compared with the term "difficulty" used in the census questionnaire, the NHIS probes whether a condition prevents the person from working or limits the kind or amount of work. Once again, note the lack of distinction between the ability to perform the activities associated with the actual performance of the job and those activities related to the role of work. For those who retire early because of a health condition or impairment, would the respondent consider that health problem as keeping the person from working?
The point of the examples presented above is not to criticize the questionnaires in which they appear but rather to illustrate the problem of attempting to measure a complex, multidimensional, dynamic construct with a single question or a set of two questions. No one or even two questions can possibly tap into the various components of work disabilities. Clearly the first step toward a robust set of screening items is the acceptance of a shared conceptual framework and understanding of the dimensions of the construct of interest. That framework must consider the social environment in which the measurement of interest will be taken, understanding that the comprehension of the question is shaped not only by the specific words used in the question and the context of the question, but by the perceived intent of the question. The use of cognitive laboratory techniques can aid in the identification of problems of comprehension due to the use of inherently vague terms and differential perceptions of the intent of the question. Such techniques will aid in the understanding of the validity of the questions and, through the refinement of the wording of questions, hopefully improve the reliability of the items.
Simply documenting that variation in the essential survey conditions of the measurement process contributes to different estimates of persons with work disabilities is not sufficient; the marginal effects of various factors need to be measured and the impact needs to be reduced through the use of alternative design features. Both of these can be accomplished only through a program of experimentation. Similarly, the psychometric properties of these measures need to be assessed. Without undertaking a thorough program of development and evaluation, the discrepant estimates evident in the empirical literature will persist.
1 Other noninterview is used to classify cases in which contact was made with the members of the household in which the sample person resides, but for reasons such as physical or mental health, language difficulties, or other reasons not related to reluctance to participate, the interviewer was unable to conduct the interview.
2 Within survey research, the conduct of a reinterview under the same essential survey conditions as the original interview is an example of a test-retest assessment of reliability.
3 For example: "Because of a physical, mental or emotional problem does anyone in the family have any difficulty with activities such as bathing, dressing, eating, getting in or out of a chair or bed, or walking across a room?"
4 See also Beatty and Davis (1998) for a cognitive evaluation of questions from Survey of Income and Program Participation and the National Health Interview Survey concerning discrepancies in print reading disability statistics.
5 Note, however, that the allocation across modes was not experimentally varied but rather was an artifact in the design in which older respondents (80 years and older) were assigned to the face-to-face mode of data collection and those less than 80 years of age were assigned to the telephone mode of data collection. However, a substantial number of respondents were interviewed in the mode other than that to which they were originally assigned; the crossover permits determination of both main and interaction effects related to the mode of data collection.
6 In comparisons of self-reports and proxy reports with clinical evaluations, Rubenstein et al. (1984) found self response to be more "optimistic" and responses obtained by proxy report to be more pessimistic, findings which suggest that both self and proxy responses are subject to measurement error, albeit in different directions.