Glossary - NCME

The NCME Assessment Glossary provides definitions of frequently-used terms in educational measurement. For many of the terms, multiple definitions can be found in the literature; also, technical usage may differ from common usage.

ability testing: The use of tests to evaluate the current performance of a person in some defined domain of cognitive, psychomotor, or physical functioning.

ability/parameter: In item response theory (IRT), a theoretical value indicating the level of a test taker on the ability or trait measured by the test; analogous to the concept of true score in classical test theory.

accessible/accessibility: Degree to which the items or tasks on a test enable as many test takers as possible to demonstrate their standing on the target construct without being impeded by characteristics of the item that are irrelevant to the construct being measured.

accommodations/ accommodated tests or assessments: Adjustments that do not alter the assessed construct that are applied to test presentation, environment, content, format (including response format), or administration conditions for particular test takers that are embedded within assessments or applied after the assessment is designed. Accommodated scores should be sufficiently comparable to unaccommodated scores that they can be aggregated together

accountability index: A number or label that reflects a set of rules for combining scores and other information to form conclusions and inform decision making in an accountability system.

accountability system: A system that imposes student performance-based rewards or sanctions on institutions such as schools or school systems or on individuals such as teachers or mental health care providers.

acculturation: A process related to the acquisition of cultural knowledge and artifacts that is developmental in nature and dependent upon time of exposure and opportunity for learning.

achievement levels/proficiency levels: Descriptions of a test taker’s level of competency in a particular area of knowledge or skill, usually defined as ordered categories on a continuum, often labeled from “basic” to “advanced,” or “novice” to “expert,” that constitute broad ranges for classifying performance. See cut score.

achievement standards: See performance standards.

achievement test: A test to measure the extent of knowledge or skill attained by a test taker in a content domain in which the test taker has received instruction.

adaptation/ test adaptation: 1. Any change in test content, format (including response format), or administration conditions that is made to increase the test accessibility for individuals who otherwise would face construct-irrelevant barriers on the original test. An adaptation may or may not change the meaning of the construct being measured or alter score interpretations. An adaptation that changes score meaning is referred to as a modification; an adaptation that does not change the score meaning is referred to as an accommodation. See accommodations and modifications. 2. Changes made to a test that has been translated into the language of a target group that takes into account the nuances of the language and culture of that group.

adaptive test: A sequential form of individual testing in which successive items, or sets of items, in the test are selected for administration based primarily on their psychometric properties and content, in relation to the test taker’s responses to previous items.

adjusted validity/ reliability coefficient: A validity or reliability coefficient—most often, a product-moment correlation—that has been adjusted to offset the effects of differences in score variability, criterion variability, or the unreliability of test and/or criterion scores. See restriction of range or variability.

aggregate score: A total score formed by combining scores on the same test or across test components. The scores may be raw or standardized. The components of the aggregate score may be weighted or not depending on the interpretation to be given to the aggregate score.aggregate score

alignment: Degree to which the content and cognitive demands of test questions match targeted content and cognitive demands described in the test specifications.

alternate assessments or alternate tests: Used to evaluate the performance of students in educational setting who are unable to participate in standardized accountability assessments even with accommodations.

alternate forms: Two or more versions of a test that are considered interchangeable, in that they measure the same constructs in the same ways, are built to the same content and statistical specifications, and are administered under the same conditions using the same directions. See equivalent forms, parallel forms.

analytic scoring: A method of scoring constructed responses (such as essays) in which each critical dimension of a particular performance is judged and scored separately, and the resultant values are combined for an overall score. In some instances, scores on the separate dimensions may also be used in interpreting performance. Contrast with holistic scoring.

anchor items: Items administered with each of two or more alternate forms of a test for the purpose of equating the scores obtained on these alternate forms.

anchor test: A set of anchor items used for equating.

assessment: Any systematic method of obtaining information from tests and other sources, used to draw inferences about characteristics of people, objects, or programs; a process designed to systematically measure or evaluate the characteristics or performance of individuals, programs, or other entities, for purposes of drawing inferences; sometimes used synonymously with test.

assessment literacy: Knowledge about testing that supports valid interpretations of test scores for their intended purposes, such as knowledge about test development practices, test score interpretations, threats to valid score interpretations, score reliability and precision, test administration, and use.

automated scoring: A procedure by which constructed response items are scored by computer using a rules-based approach.

battery: A set of tests usually administered as a unit. The scores on the tests usually are scaled so that they can readily be compared or used in combination for decision making.

behavioral science: A scientific discipline, such as sociology, anthropology, or psychology, in which the actions and reactions of humans and animals are studied through observational and experimental methods.

benchmark assessments: Assessments administered in educational settings at specified times during a curriculum sequence, to evaluate students’ knowledge and skills relative to an explicit set of longer-term learning goals. See Interim assessments

bias: 1. In test fairness, construct underrepresentation or construct-irrelevant components of test scores which differentially affect the performance of different groups of test takers and consequently the reliability/precision and validity of interpretations and uses of their test scores. 2. In statistics or measurement, systematic error in a test score. See predictive bias, construct underrepresentation, construct irrelevance, fairness.

Bilingual/mulitlingual: Having a degree of proficiency in two or more languages.

calibration: 1. In linking test scores, the process of relating scores on one test to scores on another that differ in reliability/precision from the first test, so that scores have the same relative meaning for a group of test takers. 2. In item response theory, the process of estimating the parameters of the item response function. 3. In scoring constructed responses tasks, procedures used during training and scoring to achieve a desired level of scorer agreement.

certification: A process by which individuals are recognized (or certified) as having demonstrated some level of knowledge and skill in some domain. See licensing, credentialing.

classical test theory: A psychometric theory based on the view that an individual’s observed score on a test is the sum of a true score component for the test taker and an independent random error component.

classification accuracy: Degree to which the assignment of test takers to specific categories is accurate; the degree to which false positive and false negative classifications are avoided. See sensitivity and specificity.

coaching: Planned short-term instructional activities for prospective test takers provided prior to the test administration for the primary purpose of improving their test scores. Activities that approximate the instruction provided by regular school curricula or training programs are not typically referred to as coaching.

coefficient alpha: An internal consistency reliability coefficient based on the number of parts into which the test is partitioned (e.g., items, subtests, or raters), the interrelationships of the parts, and the total test score variance. Also called Cronbach’s alpha and, for dichotomous items, KR 20. See Internal consistency reliability.

cognitive assessment: The process of systematically collecting test scores and related data in order to make judgments about an individual’s ability to perform various mental activities involved in the processing, acquisition, retention, conceptualization, and organization of sensory, perceptual, verbal, spatial, and psychomotor information.

cognitive labs: A method of studying the cognitive processes test takers use when completing a task such as solving a mathematics problem or interpreting a passage of text, typically involving examinees thinking aloud while responding to the task, and or responding to interview questions after completing the task.

cognitive science: The interdisciplinary study of learning and information processing.

comparability (score comparability): In test linking, the degree of score comparability resulting from the application of a linking procedure varies along a continuum that depends on the type of linking conducted. See alternate forms, equating, calibration, linking, moderation, projection, and vertical scaling

composite score: A score that combines several scores according to a specified formula.

computer-administered test: A test administered by a computer, test takers respond by using a keyboard, mouse, or other response devices.

computer-based mastery test: A test administered by computer that indicates whether or not the test taker has achieved a specified level of competence in a certain domain, rather than the test takers’ degree of achievement in that domain. See mastery test.

computer-based test: See computer-administered test.

computer-prepared interpretive report: A programmed interpretation of a test taker’s test results, based on empirical data and/or expert judgment using various formats such as narratives, tables, and graphs. Sometimes referred to as automated score/narrative reports.

computerized adaptive test: An adaptive test administered by computer. See adaptive testing.

concordance: In linking test scores for tests that measure similar constructs, the process of relating a score on one test to a score on another, so that the scores have the same relative meaning for a group of test takers.

conditional standard error of measurement: The standard deviation of measurement errors that affect the scores of test takers at a specified test score level.

confidence interval: An interval within which, with specified probability, would include the parameter of interest.

consequences: The outcomes, intended and unintended, of using tests in particular ways in certain contexts and with certain populations.

construct: Concept or characteristic the test is designed to measure.

construct domain: The set of interrelated attributes (e.g., behaviors, attitudes, values) that are included under a construct’s label.

construct equivalence: 1. The extent to which the construct measured by one test is essentially the same as the construct measured by another test. 2. The degree to which a construct measured by a test in one cultural or linguistic group is comparable to the construct measured by the same test in a different cultural or linguistic group.

construct underrepresentation: The extent to which a test fails to capture important aspects of the construct domain that the test is intended to measure resulting in test scores that do not fully represent that construct.

construct-irrelevant variance: Variance in test-taker scores that is attributable to extraneous factors that distort the meaning of the scores, and thereby, decrease the validity of the proposed interpretation.

constructed-response items/tasks/exercises: An exercise or task for which test takers must create their own responses or products rather than choose a response from an enumerated set. Short-answer items require a few words or a number as an answer, whereas extended-response items require at least a few sentences and may include diagrams, mathematical proofs, essays, problem solutions such as network repairs on other work products.

content domain: The set of behaviors, knowledge, skills, abilities, attitudes or other characteristics to be measured by a test, represented in detailed test specifications, and often organized into categories by which items are classified.

content standard: In educational assessment, a statement of content and skills that students are expected to learn in a subject matter area often at a particular grade or at the completion of a particular level of schooling.

content-related validity evidence: Evidence based on test content that supports the intended interpretation of test scores for a given purpose. Such evidence may address issues such as the fidelity of test content to performance in the domain in question and the degree to which test content representatively samples a domain such as a course curriculum or job.

convergent evidence: Evidence based on the relationship between test scores and other measures of the same or related construct.

credentialing: : Granting to a person, by some authority, a credential, such as a certificate, license, or diploma, that signifies an acceptable level of performance in some domain of knowledge or activity.

criterion domain: The construct domain of a variable that is used as a criterion. See construct domain.

criterion-referenced score interpretation: The meaning of a test score for an individual or an average score for a defined group, indicating an individual’s or group’s level of performance in relationship to some defined criterion domain. Examples of criterion-referenced interpretations include comparison to cut scores, interpretations based on expectancy tables, and domain- referenced score interpretations (Contrast with norm-referenced score interpretation.)

cross-validation: A procedure in which a scoring system for predicting performance, derived from one sample, is applied to a second sample in order to investigate the stability of prediction of the scoring system.

cut score: : A specified point on a score scale, such that scores at or above that point are reported, interpreted, or acted upon differently from scores below that point.

differential item functioning: For a particular item in a test, a statistical indicator of the extent to which different groups of test takers who are at the same ability level have different frequency of correct responses or, in some cases, different rates of choosing various item options. Also known as DIF.

differential test functioning: Differential performance at the test or dimension level indicating that individuals from different groups who have the same standing on the characteristic assessed by a test do not have the same expected test score. Referred to as DTF.

discriminant evidence: Evidence indicating whether two tests interpreted as measures of different constructs are sufficiently independent (uncorrelated) to be considered two distinct constructs.

documentation: The body of literature (e.g., test manuals, manual supplements, research reports, publications, user’s guides, etc.) developed by a test’s author, developer, test user, and publisher to support test score interpretations for their intended use.

Domain/content sampling: The process of selecting test items, in a systematic way, to represent the total set of items measuring a domain.

effort: Extent to which a test taker appropriately participates in test taking.

empirical evidence: Evidence based on some form of data, as opposed to that based on logic or theory.

English language learner: An individual who is not yet proficient in English, including an individual whose first language is not English and a language minority individual just beginning to learn English, as well as an individual who has developed considerable proficiency in English. Related terms include limited English proficient (LEP), English as a second language (ESL), and culturally and linguistically diverse (CLD).

equated forms: Alternate forms of a test whose scores have been related through a statistical process known as equating which allows scale scores on equated forms to be used interchangeably.

equating: A process for relating scores on alternate forms of a test so that they have essentially the same meaning. The equated scores are typically reported on a common score scale.

equivalent forms: See alternate forms and parallel forms.

error of measurement: The difference between an observed score and the corresponding true score. See standard error of measurement, systematic error, random error and true score.

factor: Any variable, real or hypothetical, that is an aspect of a concept or construct.

factor analysis: Any of several statistical methods of describing the interrelationships of a set of variables by statistically deriving new variables, called factors, that are fewer in number than the original set of variables.

fairness: The validity of test score interpretations for intended use(s) for individuals from all relevant subgroups. A test that is fair minimizes the construct irrelevant variance associated with individual characteristics and testing contexts that otherwise would compromise the validity of scores for some individuals.

fake bad: Extent to which test takers exaggerate their responses (e.g., symptom over- endorsement) to test items in an effort to appear impaired.

fake good: Extent to which test takers exaggerate their responses (e.g., superior adjustment or excessive virtue) to test items to present themselves in an overly positive way.

false negative: An error of classification, diagnosis, or selection in which an individual does not meet the standard based on the assessment for inclusion in a particular group but in truth does (or would) meet the standard. See sensitivity and specificity.

false positive: An error of classification, diagnosis, or selection in which an individual meets the standard based on the assessment for inclusion in a particular group but in truth does not (or would not) meet the standard. See sensitivity and specificity.

field test: A test administration used to check the adequacy of testing procedures, and the statistical characteristics of new test items or new test forms. A field test is generally more extensive than a pilot test. See pilot test.

flag: An indicator attached to a test score, a test item, or other entity to indicate a special status. A flagged test score generally signifies a score obtained from a modified test resulting in a change in the underlying construct measured by the test. Flagged scores may not be comparable to scores that are not flagged.

formative assessment: An assessment process used by teachers and students during instruction that provides feedback to adjust ongoing teaching and learning with the goals of improving students’ achievement of intended instructional outcomes.

gain score: In testing, the difference between two scores obtained by a test taker on the same test or two equated tests taken on different occasions, often before and after some treatment.

generalizability coefficient: An index of reliability/precision based on generalizability theory (G theory). A generalizability coefficient is the ratio of universe score variance to observed score variance, where the observed score variance is equal to the universe score variance plus the total error variance. See generalizability theory.

generalizability theory: Methodological framework for evaluating reliability/precision in which various sources error variance are estimated through the application of the statistical techniques of analysis of variance. The analysis indicates the generalizability of scores beyond the specific sample of items, persons, and observational conditions that were studied.

group testing: Tests that are administered to groups of test takers, usually in a group setting, typically with standardized administration procedures and supervised by a proctor or test administrator.

growth models: Statistical models that measure students’ progress on achievement tests by comparing the test scores of the same students over time. See value-added modeling.

high-stakes test: A test used to provide results that have important, direct consequences for individuals, programs, or institutions involved in the testing. Contrast with low-stakes tests.

holistic scoring: A method of obtaining a score on a test, or a test item, based on a judgment of overall performance using specified criteria. Contrast with analytic scoring.

individualized education program (IEP): A document that delineates special education services for a special-needs student and that includes any adaptations that are required in the regular classroom or for assessments and any additional special programs or services.

informed consent: The agreement of a person, or that person’s legal representative, for some procedure to be performed on or by the individual, such as taking a test or completing a questionnaire.

intelligence test: A test designed to measure an individual’s level of cognitive functioning in accord with some recognized theory of intelligence. See cognitive assessment.

inter-rater agreement/consistency: The level of consistency with which two or more judges rate the work or performance of test takers. See inter-rater reliability. inter-rater reliability: consistency in rank ordering of ratings across raters. See inter-rater agreement.

interim assessments/tests: Assessments administered during instruction to evaluate students’ knowledge and skills relative to a specific set of academic goals to inform policymaker or educator decisions at the classroom, school, or district level. (See benchmark assessments.)

internal consistency coefficient: An index of the reliability of test scores derived from the statistical interrelationships among item responses or scores on separate parts of a test. See coefficient alpha and split -halves reliability.

internal structure: In test analysis, the factorial structure of item responses or subscales of a test.

interpreter: Someone who facilitates cross-cultural communication by converting concepts from one language to another (including sign language).

intra-rater reliability: The degree of agreement among repetitions of a single rater in scoring test takers’ responses. Inconsistencies in the scoring process resulting from influences that are internal to the rater rather than true differences in test takers’ performances result in low intra-rater reliability.

inventory: A questionnaire or checklist that elicits information about an individual’s personal opinions, interests, attitudes, preferences, personality characteristics, motivations, or typical reactions to situations and problems.

item: A statement, question, exercise, or task on a test for which the test taker is to select or construct a response, or perform a task. See prompt.

item characteristic curve: A mathematical function relating the probability of a certain item response, usually a correct response, to the level of the attribute measured by the item. Also called item response curve, item response function, or ICC.

item context effect: Influence of item position, other items administered, time limits, administration conditions, etc., on item difficulty and other statistical item characteristics.

item pool/item bank: The collection or set of items from which a test or test scale’s items are selected during test development, or the total set of items from which a particular subset is selected for a test taker during adaptive testing.

item response theory (IRT): A mathematical model of the functional relationship between performance on a test item, the test item’s characteristics, and the test taker’s standing on the construct being measured.

job (or job classification): A group of positions that are similar enough in duties, responsibilities, necessary worker characteristics, and other relevant aspects that they may be properly placed under the same job title.

job analysis: The investigation of positions or job classes to obtain information about job duties and tasks, responsibilities, necessary worker characteristics (e.g. knowledge, skills, and abilities), working conditions, and/or other aspects of the work. See Practice Analysis.

job performance measurement: An incumbent’s observed performance of a job that can be evaluated by a job sample test, an assessment of job knowledge, or ratings of the incumbent’s actual performance on the job. See job sample test.

job sample test: A test of the ability of an individual to perform the tasks of which the job is comprised. See job performance measurement.

licensing: The granting, usually by a government agency, of an authorization or legal permission to practice an occupation or profession. See certification, credentialing.

linking (score linking): The process of relating scores on tests. See alternate forms, equating, calibration, moderation, projection, and vertical scaling.

local evidence: Evidence (usually related to reliability/precision or validity) collected for a specific test and a specific set of test takers in a single institution or at a specific location.

local norms: Norms by which test scores are referred to a specific, limited, reference population of particular interest to the test user (e.g., locale, organization, or institution); local norms are not intended to be representative of populations beyond that limited setting.

low-stakes test: A test used to provide results that have only minor or indirect consequences for individuals, programs, or institutions involved in the testing. Contrast with high-stakes test.

mastery/mastery test: A test designed to indicate whether a test taker has or has not attained a prescribed level of competence in a domain. See cut score, computer-based mastery test.

matrix sampling: A measurement format in which a large set of test items is organized into a number of relatively short item sets, each of which is randomly assigned to a sub-sample of test takers, thereby avoiding the need to administer all items to all test takers. Equivalence of the short item sets, or subsets, is not assumed.

meta-analysis: A statistical method of research in which the results from independent, comparable studies are combined to determine the size of an overall effect or the degree of relationship between two variables.

moderation: A process of relating scores on different tests so that scores have the same relative meaning (e.g., using a separate test that is administered to all test takers).

moderator variable: A variable that affects the direction or strength of the relationship between two other variables.

modification/modified tests: A change in test content, format (including response formats), and/or administration conditions that is made to increase accessibility for some individuals but that also affects the construct measured and, consequently, results in scores that differ in meaning from scores from the unmodified assessment.

neuropsychological assessment: A specialized type of psychological assessment of normal or pathological processes affecting the central nervous system and the resulting psychological and behavioral functions or dysfunctions.

norm-referenced score interpretation: A score interpretation based on a comparison of a test taker’s performance to the distribution of performance in a specified reference population. Contrast criterion-referenced score interpretation.

norms: Statistics or tabular data that summarize the distribution or frequency of test scores for one or more specified groups, such as test takers of various ages or grades, usually designed to represent some larger population, referred to as the reference population. See local norms.

operational use: The actual use of a test, after initial test development has been completed, to inform an interpretation, decision, or action, based, in part, upon test scores.

opportunity to learn: Whether test takers have been exposed to the tested constructs through their educational program and whether students have had exposure or experience with the language of instruction or the majority culture represented by the test.

parallel forms: In classical test theory, strictly parallel forms of tests are assumed to measure the same construct and to have exactly the same means and the same standard deviations in the populations of interest. See alternate forms.

percentile: The score on a test below which a given percentage of scores for a specified population occurs.

percentile rank: The percentage of scores in a specified score distribution that are below a given score.

performance assessments: Assessments for which the test taker actually demonstrates the skills the test is intended to measure by doing tasks that require those skills.

performance level: Labels or brief statements classifying a test taker’s competency in a particular domain; usually defined by a range of scores on a test. For example, terms such as “basic” to “advanced,” or “novice” to “expert,” constitute broad ranges for classifying proficiency. See achievement levels, cut score, proficiency level descriptor, and standard setting.

performance level descriptor: Verbal descriptors of what test takers know and can do at specific performance levels.

performance standards: Descriptions of levels of knowledge and skill acquisition contained in content standards, as articulated through performance level labels (e.g., “basic”, “proficient”, “advanced”), statements of what test takers at different performance levels know and can do, and cut scores or ranges of scores on the scale of an assessment that differentiate levels of performance. See cut scores, performance level, performance level descriptor.

personality inventory: An inventory that measures one or more characteristics that are regarded generally as psychological attributes or interpersonal tendencies.

pilot test: A test administered to a sample of test takers to try out some aspects of the test or test items, such as instructions, time limits, item response formats, or item response options. See field test.

policy study: A study that contributes to judgments about plans, principles, or procedures enacted to achieve broad public goals.

portfolio: In assessment, a systematic collection of educational or work products that have been compiled or accumulated over time, according to a specific set of principles or rules.

position: In employment contexts, the smallest organizational unit, a set of assigned duties and responsibilities that are performed by a person within an organization.

practice analysis: An investigation of a certain occupation or profession to obtain descriptive information about the activities and responsibilities of the occupation or profession and about the knowledge, skills, and abilities needed to engage successfully in the occupation or profession. See job analysis.

precision of measurement: A general term that refers to the impact of measurement error on the outcome of the measurement. See standard error of measurement, error of measurement, reliability/precision.

predictive bias: The systematic under- or over-prediction of criterion performance for people belonging to groups differentiated by characteristics not relevant to the criterion performance.

predictive validity evidence: Evidence indicating how accurately test data collected at one time can predict criterion scores that are obtained at a later time.

proctor: A person responsible, during the test administration, for monitoring the testing process and implementing the test administration procedures.

program evaluation: The collection and synthesis of evidence about the use, operation, and effects of a program; the set of procedures used to make judgments about a program’s design, implementation, and outcomes.

projection: A method of score linking in which scores on one test are used to predict scores on another test for a group of test takers, often using regression methodology.

prompt/item prompt/writing prompt: The question, stimulus, or instructions that elicit a test taker’s response. See response interaction probability

proprietary algorithms: Procedures, often computer code, used by commercial publishers or test developers that are not revealed to the public for commercial reasons.

psychodiagnosis: Formalization or classification of functional mental health status based on psychological assessment.

psychological assessment: An examination of psychological functioning that involves collecting, evaluating, and integrating test results and collateral information, and reporting information about an individual.

psychological testing: The use of tests or inventories to assess particular psychological characteristics of an individual.

random error: A non-systematic error; a component of test scores that appears to have no relationship to other variables. random sample: A selection from a defined population of entities according to a random process with the selection of each entity independent of the selection of other entities. See sample.

raw score: The score on a test that is often calculated by counting the number of correct answers, but more generally a sum or other combination of item scores.

reference population: The population of test takers represented by test norms that permits accurate estimation of the test score distribution for the reference population. The reference population may be defined in terms of test taker age, grade, or clinical status at time of testing or other characteristics. See norms.

relevant subgroup: A subgroup of the population for which the test is intended that is identifiable in some way that is relevant to the interpretation of test scores for their intended purposes.

reliability coefficient: A unit-free indicator that reflects the degree to which scores are free of random measurement error. See generalizability theory.

reliability/precision: The degree to which test scores for a group of test takers are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable, and consistent for an individual test taker; the degree to which scores are free of random errors of measurement for a given group. See generalizability theory, classical test theory, precision of measurement.

response bias: A test taker’s tendency to respond in a particular way or style to items on a test (e.g., acquiesce, choice of socially desirable options , choice of ‘true’ on a true-false test) that yields systematic, construct-irrelevant error in test scores.

response format: The mechanism that a test taker uses to respond to the test item by either selecting from a list of options (multiple-choice questions), providing a written response (fill-in, verbal or written response to an open ended or constructed response question), or other responses (oral response, physical performance).

response interaction probability (RIP): 1. The probability of an interactive response to a test item. 2. The probability that a prompt will elicit an interactive response. See prompt.

response protocol: A record of the responses given by a test taker to a particular test.

restriction of range or variability: Reduction in the observed score variance of a test-taker sample, compared to the variance of the entire test taker-population, as a consequence of constraints on the process of sampling test takers. See adjusted validity/reliability coefficient.

retesting: A repeat administration of a test; either the same test or an alternate form, sometimes with additional training or education between administrations.

rubric: See scoring rubric.

sample: A selection of a specified number of entities called sampling units (test takers, items, etc.) from a larger specified set of possible entities, called the population. See random sample and stratified random sample.

scale: 1. The system of numbers, and their units, by which a value is reported on some dimension of measurement. 2. In testing, the set of items or subtests used to measure a specific characteristic (e.g., a test of verbal ability or a scale of extroversion-introversion).

scale score: A score obtained by transforming raw scores. Scale scores are typically used to facilitate interpretation.

scaling: The process of creating a scale or a scale score in order to enhance test score interpretation by placing scores from different tests or test forms onto a common scale or by producing scale scores designed to support score interpretations. See scale.

school district: A local education agency administered by a public board of education or other public authority within a State that oversees public elementary or secondary schools in a political subdivision of a State.

score: 1. Any specific number resulting from the assessment of an individual such as a raw score, scale score, estimate of a latent variable, a production count, an absence record, a course grade, a rating.

scoring rubric: The established criteria, including rules, principles, and illustrations, used in scoring constructed responses to individual tasks and clusters of tasks

screening test: A test that is used to make broad categorizations of test takers as a first step in selection decisions or diagnostic processes.

selection: The acceptance or rejection of applicants for a particular educational or employment opportunity.

sensitivity: In classification, diagnosis, and selection, the proportion of cases assessed or predicted to meet the criteria which in truth do meet the criteria.

specificity: In classification, diagnosis, and selection, the proportion of cases assessed or predicted not to meet the criteria which in truth do not meet the criteria.

speededness: The extent to which test takers’ scores are dependent upon the rate at which work is performed as well as the correctness of the responses. The term is not used to describe tests of speed.

split-halves reliability coefficient: An internal consistency coefficient obtained by using half the items on the test to yield one score and the other half of the items to yield a second, independent score. See internal consistency, coefficient alpha.

stability: The extent to which scores on a test are essentially invariant over time assessed by correlating the test scores of a group of individuals with scores on the same test or an equated test, taken by the same group at a later time. See test-retest reliability.

standard error of measurement: The standard deviation of an individual’s observed scores from repeated administrations of a test (or parallel forms of a test) under identical conditions. Because such data cannot generally be collected, the standard error of measurement is usually estimated from group data. See error of measurement.

standard setting: The process, often judgment-based, of setting cut scores using a structured procedure that seeks to determine cut scores that define different levels of performance as specified by performance levels and performance level descriptors.

standardization: 1. In test administration, maintaining a consistent testing environment and conducting the test according to detailed rules and specifications, so that testing conditions are the same for all test takers on the same and multiple occasions. 2. In test development, establishing norms based on the test performance of a representative sample of individuals from the population with which the test is intended to be used.

standards-based assessment: Assessment of an individual’s standing with respect to systematically described content and performance standards. stratified random sample: A set of random samples, each of a specified size, from several different sets, which are viewed as strata of the population. See random sample, sample.

summative assessment: The assessment of a test taker’s knowledge and skills typically carried out at the completion of a program of learning, such as the end of an instructional unit.

systematic error: An error that consistently increases or decreases the scores of all test takers or some subset of test takers, but is not related to the construct that the test is intended to measure. See bias.

technical manual: A publication prepared by test developers and publishers to provide technical and psychometric information about a test.

test: An evaluative device or procedure in which a systematic sample of a test taker’s behavior in a specified domain is obtained and scored using a standardized process.

test design: The process of developing detailed specifications for what a test is to measure and the content, cognitive level, format, and types of test items to be used.

test developer: The person(s) or organization responsible for the design and construction of a test and for the documentation regarding its technical quality for an intended purpose.

test development: The process through which a test is planned, constructed, evaluated, and modified, including consideration of content, format, administration, scoring, item properties, scaling, and technical quality for its intended purpose.

test documents: Documents such as test manuals, technical manuals, user’s guides, specimen sets, and directions for test administrators and scorers that provide information for evaluating the appropriateness and technical adequacy of a test for its intended purpose.

test form: A set of test items or exercises that meet requirements of the specifications for the testing program. Many testing programs use alternate test forms, each built according to the same specifications but with some or all of the test items unique to each form. See alternate forms.

test format/mode: The manner in which the test content is presented to the test taker, such as in paper-and-pencil, via a computer terminal, through the internet, or verbally by an examiner.

test information function: A mathematical function relating each level of an ability or latent trait, as defined under item response theory (IRT), to the reciprocal of the corresponding conditional measurement error variance.

test manual: A publication prepared by test developers and publishers to provide information on test administration, scoring, and interpretation and to provide selected technical data on test characteristics. See user’s guide and technical manual.

test modification: Changes made in the content, format, and/or administration procedure of a test to increase the accessibility of the test for test takers who are unable to take the original test under standard testing conditions. In contrast to test accommodations, test modifications change the construct being measured by the test to some extent and hence change score interpretations. See adaptation, test adaptation, modification, modified tests. Contrast with accommodation / accommodated tests or assessments.

test publisher: An entity, individual, organization, or agency that produces and or distributes a test.

test security: Protecting the content of the test from unauthorized release or use and to protect the integrity of the test scores so they are valid for their intended use

test specifications: Documentation of the purpose and intended uses of the test as well as the content, format, test length, psychometric characteristics of the items and test, delivery mode, administration, scoring, and score reporting

test user: The person(s) or entity responsible for the choice and administration of a test, for the interpretation of test scores produced in a given context, and for any decisions or actions that are based, in part, on test scores.

test-retest coefficient/ reliability: A reliability coefficient obtained by administering the same test a second time to the same group after a time interval and correlating the two sets of scores; typically used as a measure of stability of the test scores. See stability.

test-taking strategies: Strategies that test takers might use while taking the test to improve their performance, such as time management or the elimination of obvious incorrect options on a multiple-choice question before responding to the question.

timed test: A test administered to test takers who are allotted a prescribed amount of time to respond to the test.

top-down selection: Selecting applicants on the basis of rank ordered test score from highest to lowest.

true score: In classical test theory, the average of the scores that would be earned by an individual on an unlimited number of strictly parallel forms of the same test.

unidimensional: A test that measures only one dimension or only one latent variable.

universal design: An approach to assessment development that attempts to maximize the accessibility of a test for all of its intended test takers.

universe score: In generalizability theory, the expected value over all possible replications of the procedure for the test taker. See generalizability theory

user norms: Descriptive statistics (including percentile ranks) for a group of test takers that does not represent a well-defined reference population, for example, all persons tested during a certain period of time, or a set of self-selected test takers. See local norms; norms.

user’s guide: A publication prepared by test developers and publishers to provide information on a test’s purpose, appropriate uses, proper administration, scoring procedures, normative data, interpretation of results, and case studies. See test manual.

validation: The process through which the validity of the proposed interpretation of test scores for their intended uses is investigated.

validity: The degree to which accumulated evidence and theory support a specific interpretation of test scores for a given use of a test. If multiple interpretations of a test score for different uses are intended, validity evidence for each interpretation is needed.

validity argument: An explicit justification of the degree to which accumulated evidence and theory support the proposed interpretation(s) of test scores for their intended uses.

validity generalization: Applying validity evidence obtained in one or more situations to other similar situations on the basis of methods such as meta-analysis.

value-added modeling: A collection of complex statistical techniques that use multiple years of student outcome data, typically standardized test scores, to estimate the contribution of individual schools or teachers to student performance. See growth models.

variance components: Variances accruing from the separate constituent sources that are assumed to contribute to the overall variance of observed scores. Such variances, estimated by methods of the analysis of variance, often reflect situation, location, time, test form, rater, and related effects. See generalizability theory.

vertical scaling: In test linking, the process of relating scores on tests that measure the same construct but differ in difficulty, typically used with achievement and ability tests with content or difficulty that spans a variety of grade or age levels.

vocational assessment: A specialized type of psychological assessment designed to generate hypotheses and inferences about interests, work needs and values, career development, vocational maturity, and indecision.

weighted scores/scoring: A method of scoring a test in which a number of points is awarded for a correct (or diagnostically relevant) response. In some cases, the scoring formula awards more points for one response to an item than for another response.