Glossary of Important Assessment and Measurement Terms
The purpose of this glossary is to provide an accessible dictionary of assessment-related terms for use by a non-technical, lay audience. The intention of the glossary is to offer definitions for “technical terms” or “specialized jargon” in a way that might be readily understood by those who have not studied in the field or have not developed expertise through work experience in the field. A full statement of purpose can be found here . A description of the glossary's development can be found here . Suggestions for additions or other changes to the glossary can be made here .
To find the meaning of a word or phrase, you can use the A–Z listing shown on this page, or you can enter the word in the search box provided on this page. If the word you are trying to define is in the glossary, the term and its definition will appear immediately below the search box. If the term you enter is not in the dictionary, nothing will appear below the search box. In that case, if you had entered the word in the search box, try again entering only the first few letters or using the A–Z listing, in case you misspelled the word.
Do an A-Z Search
A psychological characteristic or trait that usually is the target when testing. Examples are achievement, intelligence, academic aptitude, and attitude. (See also construct).
A change made in a standard test-administration procedure to reduce or remove the influence of a test taker's disability on the assessment process. Examples include extended testing time limits and having certain tests read aloud. When implemented appropriately, such changes do not alter the meaning of the scores. (See also modification).
A program, often legislated, that attributes the responsibility for student learning to teachers, school administrators, and/or students. Test results typically are used to judge accountability, and often consequences are imposed for shortcomings.
The extent of knowledge or skill possessed by a student within some specific area of the school curriculum, such as mathematics, science, or writing.
A set of ordered descriptions of levels of competence or achievement (for example, low to high) that is used for classifying student test performance. An example of one set is: "Basic", "Proficient", and "Advanced". (See also proficiency levels and performance levels).
A computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker’s responses to the most recent items administered.
adequate yearly progress (AYP)
The amount of annual achievement growth to be expected by students in a particular school, district, or state in the U.S. federal accountability system, No Child Left Behind (NCLB).
The chronological age of students in a given population for whom a given test score is the median (middle) score. For example, if Jim obtained an age-equivalent score of 10-2, this means the typical student getting that same raw score was 10 years and two months old. (See also grade-equivalent score).
The extent to which the content and cognitive demands of an assessment tool are consistent with (or match) those given in a set of content standards or benchmarks that describe the curriculum with which the assessment was designed to be used.
An assessment used as a substitute for certain students who are unable, generally because of disabilities, to take the one given to most students.
Two or more versions of the same test that have been demonstrated to be interchangeable in what they measure and how well they measure it. Also known as equivalent forms, parallel forms, or comparable forms.
A method of scoring work products, like writing, or performances like a musical selection, in which specific aspects of the performance are rated and given separate scores. For example, an assessment of descriptive writing might provide separate scores for organization, use of sensory detail, and mechanics of writing. (See also holistic scoring).
A written paper or work document that is considered representative in quality of a particular score point on the score scale being used by judges or raters during scoring.
A subset of items, common to two or more test forms, that are administered for purposes of score equating. (See also equating ).
A test designed and used to predict how well someone might perform in a certain ability area in the future. Examples include scholastic, musical, clerical, verbal, and mechanical aptitude.
A tool or method of obtaining information from tests or other sources about the achievement or abilities of individuals. Often used interchangeably with test.
An assessment containing items that are judged to be measuring the ability to apply and use knowledge in real-world contexts.
Scoring of test responses by mechanical or electronic means without direct human observation of the responses, used with scoring both objective and subjective tests.
battery of tests
A set of separate tests, usually administered as a unit, that measure distinct but related abilities or skills. Resulting scores are often used to identify relative strengths and weaknesses.
Short assessments used by teachers at various times throughout the school year to monitor student progress in some area of the school curriculum. These also are known as interim assessments (See also formative use of assessments).
Systematic errors in test content, test administration, and/or scoring procedures that can cause some test takers to get either lower or higher scores than their true ability would merit. The source of the bias is irrelevant to the trait the test is intended to measure. (See also error score).
A credentialing test used to determine whether individuals are knowledgeable enough in a given occupational area to be labeled “competent to practice” in that area. (See also licensure testing).
The score a test taker would be expected to obtain if all responses were made randomly, or strictly by guessing. It is most relevant in testing situations in which examinees respond by selecting from a set of given options, as in multiple-choice testing.
classical test theory
A theory of testing based on the idea that a person’s observed or obtained score on a test is the sum of a true score (error-free score) and an error score.
Planned, short-term activities for prospective test takers that are intended to maximize the scores of those individuals on an upcoming test. Activities may include new learning, practice with previous learning, and using test-taking strategies.
The ability of an individual to perform the various mental activities most closely associated with learning and problem solving. Examples include verbal, spatial, psychomotor, and processing-speed ability.
(See equivalent forms)
Scores from two or more tests that might reasonably be compared, or used interchangeably, because the tests have been shown to measure similar content and skills with about the same level of accuracy. (See also equating).
A score formed by combining (for example, adding or averaging) the scores from several tests to obtain a total score. Most often the tests are part of a test battery. (See also battery of tests).
computer adaptive testing (CAT)
(See adaptive test.)
A test administered on a computer rather than with paper and pencil: items are displayed on a screen and the test taker responds using a keyboard, mouse, or similar device.
Information gathered in the process of validation to show the extent to which scores from one test might be used in place of, or interchangeably with, those from another test. (See also criterion-related evidence and validity.)
confidence band (interval)
A range of scores that indicates, with a certain degree of probability, where a person’s true score (containing no random errors) lies. A confidence band is formed with a person’s observed score ± the standard error of measurement.
Evidence gathered in a validation process to examine the positive and negative implications of using scores from a test or testing program to support certain decisions (often policy related). (See also validity.)
The psychological trait or characteristic that an assessment tool has been designed to measure. Examples include achievement, cognitive ability, and interests.
Situations in which the scores of test takers are influenced, positively or negatively, by factors that are different from those the test is intended to measure. For example, when the reading requirements for a science test interfere with the ability of some students to respond, reading comprehension is considered an irrelevant construct that diminishes the meaning of the science scores obtained.
A deficiency in the alignment between a test’s content and what the content should be; a situation in which certain important aspects of the content of a test are missing or only included to a lesser extent than the test specifications require.
Information gathered to show that a score on a certain test is a measure of the construct intended by the developer or is not a measure of some competing construct. This validation evidence might be in the form of correlations with other variables, factor analyses, internal-consistency reliability coefficients, observations of response processes, or judgments about the relevance of test content. (See also validity.)
A test item in which the responders must create a response or product rather than choose a response from a set supplied with the item. A short-answer item, mathematics problem, and writing sample are examples.
The entire realm of behaviors, knowledge, skills, abilities, or other characteristics that a particular test is intended to measure, as reflected by its test specifications, and about which the scores are intended to be generalized.
A statement or goal that describes something a student is expected to know in a certain subject matter area at the completion of a particular grade or level of schooling.
A statistic used to show how the scores from one measure relate to scores on a second measure for the same group of individuals. A high value (approaching +1.00) is a strong direct relationship, a low negative value (approaching -1.00) is a strong inverse relationship, and values near 0.00 indicate little, if any, relationship.
Scores used in obtaining criterion-related validity evidence—concurrent evidence or predictive evidence —for a set of scores. (See also criterion-related evidence.)
criterion-referenced score interpretation
An interpretation that involves comparing a test taker’s score with an absolute standard, or an ordered set of performance descriptions, rather than scores of other individuals. Comparing to a cut score, an expectancy table, or ordered set of behavior descriptions are examples. These are contrasted with norm-referenced score interpretations.
Information gathered to support the argument that a test does measure the same thing as some other instrument or that it does not measure the same thing as some other particular instrument. Scores from the “other” instrument are referred to as criterion scores. (See also validity.)
The point on a score scale that differentiates the interpretations made about those scoring above it from those scoring below it. Pass-fail, accepted-rejected, and proficient-not proficient are examples. Cut scores also are known as cutoff scores.
The extent to which decisions based on the scores from one measure are the same as those made with scores from a comparable measure, or the same measure on another occasion. An index of agreement is often used to indicate the extent of consistency.
A score scale to which raw scores are converted to enhance their interpretation. Examples are percentile ranks, standard scores, and grade-equivalent scores.
A characteristic such as gender, race/ethnicity, geographic residence, poverty index, or socio-economic status, that is often used to classify individuals when interpreting or reporting test scores of subgroups.
developmental standard score
A type of derived score that describes the level of growth or development represented by an individual’s test performance.
A process of gathering information about an individual to permit placing the person’s behavior into categories or classifications, which are then used to design interventions or treatments.
differential item functioning (DIF)
A statistical characteristic of an item that shows the extent to which the item might be measuring different abilities for members of separate subgroups. Average item scores for subgroups having the same overall score on the test are compared to determine whether the item is measuring in essentially the same way for all subgroups. The presence of DIF requires review and judgment, and it does not necessarily indicate the presence of bias.
A property of a test item, usually represented by the proportion of a group that correctly answers the item, obtained from a specific group on a single occasion. It is also used in item response theory (IRT) models (the “b” parameter) to represent the probability that a group of test takers who have the same level of ability will answer the item correctly.
The ability of a test item to differentiate individuals who know from those who don’t know, those who have a skill from those who don’t have the skill, or those who understand from those who don’t understand whatever the test item is intended to measure. Several different indexes are used to represent item discrimination. The term is also used in item response theory (IRT) models (the “a” parameter) to represent the extent to which the probability of a correct response changes as ability level increases or decreases.
The incorrect options that are listed with the keyed response in a multiple-choice or other selected-response test item. Sometimes called foils.
The universe or entire set of behaviors or content knowledge that a test is intended to measure and that its scores should be interpreted to represent. Most tests contain only a sample of the domain they are intended to be measuring. (See also content domain.)
English Language Learner (ELL)
A student whose native language is not English and who is in the early stages of learning English while studying other subject areas of the school curriculum.
The process of placing scores from two or more parallel test forms onto a common score scale. The result is that scores from two different test forms can be compared directly, or treated as though they came from the same test form. When the tests are not parallel, the general process is called linking.
(See parallel forms.)
In classical test theory, the score that represents the random error portion of a test taker’s obtained or observed score. The other portion is the true score. (See also classical test theory.)
The process of gathering information to make a judgment about the quality or worth of some program or performance. The term also is used to refer to the judgment itself, as in “My evaluation of his work is . . . .”
The variability in test scores that occurs among individuals in a group because of differences in those persons that are irrelevant to what the test is intended to measure. For example, a science test that requires mathematics skills and a reading ability beyond what its content domain specifies will have two sources of extraneous variance. In this case, students’ science scores might differ, not only because of differences in their science achievement, but also because of differences in their (extraneous) mathematics and reading abilities. (See also construct irrelevance.)
field test (tryout)
A test administration used during the test development process to check on the quality and appropriateness of test items, administration procedures, scoring, and/or reporting. Sometimes the field-test items are included as part of an operational test administration.
(See test forms.)
formative use of assessments
The use of assessments during the instructional process to monitor the progress of learning and the effectiveness of instruction so that adjustments can be made, as needed. This use is contrasted with the summative use of assessments.
grade-equivalent score (GE)
A developmental score that describes the typical grade and month in school for students who obtained a particular raw score. A student who is assigned a grade equivalent of 5.3, for example, obtained the same raw score on the test as the typical student in the third month of fifth grade.
A testing program for which there are important consequences (Promotion, salary level, job retention, and required school reorganization are examples.) for students, teachers, administrators, or schools, based on the level of scores students attain.
A method of scoring work products, like writing, or performances, like a musical selection, in which all aspects of the performance are judged collectively and the overall performance is assigned a single score. (See also analytical scoring.)
individualized educational plan (IEP)
A written plan to document particular goals and actions that should be incorporated into a given student’s instructional program. Aspects of the plan include learning objectives, methods of instruction, and procedures for assessing student progress. Generally, such plans are required for students who receive special education services.
A psychological construct that represents the general cognitive-functioning ability of an individual. The exact meaning of the term varies according to the several theories of intelligence that psychologists tend to recognize.
intelligence quotient (IQ)
Historically, a score obtained by dividing a person’s mental age score, obtained by administering an intelligence test, by the person’s chronological age, both expressed in terms of years and months. The resulting fraction is multiplied by 100 to obtain the IQ score.
An assessment tool, usually in the form of a self-report questionnaire or checklist, that is used to obtain information about a person’s attitudes, interests, preferences, or some other personality trait.
The label, often used informally, to refer to scores from an intelligence test or a test of cognitive ability.
A question, exercise, task, or statement on a test for which the test taker is asked to select a response, create a response, or perform an activity that will be scored.
A procedure used by test developers to examine the quality of an item prior to its selection for use on a test, or to determine how the item might be revised before its subsequent selection. Often statistical properties such as difficulty, discrimination, and DIF are evaluated in the process.
Statistics that summarize the performance on a single test item by particular groups. (See also norms.)
item response theory (IRT)
A theory of testing based on the relationship between individuals’ performances on a test item and the test takers’ levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics.
keyed response (option)
The response to an item for which the test taker is awarded the maximum possible score on that item. It is also known as the correct response. Sometimes called the key.
The use of tests to obtain information for making a decision about whether to award an individual a license to practice in a particular field, such as medicine, law, or real estate. (See also certification testing.)
The process of placing scores from two test forms onto a common score scale. When the test forms are parallel, the specific process is called equating.
Scores (norms) obtained from a local group of test takers, as opposed to a national group, for purposes of making norm-referenced score interpretations. The local group often is a school district, but it could be a state or a meaningful collection of districts with which comparisons are desired.
The average score obtained by some identified group. All scores are added and the sum is divided by the number of scores. (See also mode and median.)
The process of assigning a number to a person, or a person’s trait, according to specified rules. Often the rules involve using a test and counting the number of items each person answered correctly. That number represents how much of the trait the person has, and it can be compared with other information to obtain further meaning about their performance.
The score above which and below which exactly half of the scores in a certain group are located when the scores are placed in order from high to low. It also is known as the middle score or 50th percentile.
The score(s) obtained by the largest number of individuals in a group. It is the most-frequently-occurring score(s) within a given set of scores.
A change in a standard test-administration procedure, or the test content and presentation, to reduce or remove the influence of a test taker's disability on the assessment process. In contrast to an accommodation, a modification changes the nature of the trait being measured by the assessment. Scores from a modified assessment should not be compared to or combined with the scores of students who have taken the original assessment. Examples include tests converted to Braille, reading tests that are read aloud to students, and mathematics estimation tests with which a calculator is used by the test taker.
An assessment that has been changed in content, mode of display, or administration to make it more accessible to students who are unable to respond to the original version, generally because of a disability. Scores from a modified assessment should not be compared with or combined with those from the original assessment.
National Assessment of Educational Progress (NAEP)
An assessment program funded by the U.S. federal government to monitor the change over time in achievement of students nationally in selected grades and subject areas. Scores of individual students and schools are not reported, but those of states and various national subgroups are reported.
Scores (norms) obtained from a national group of test takers, as opposed to a local or state group, for purposes of making norm-referenced score interpretations. The national group often is a carefully selected sample of school districts and buildings, chosen to collectively represent key demographic characteristics of the national population of students at the time the scores are obtained.
No Child Left Behind (NCLB)
The U.S. federal reauthorization of Title I of the Elementary and Secondary Education Act of 1965. It is a state-level accountability program designed to narrow the gap in achievement between subgroups of students who are regarded as disadvantaged versus those who are not.
norm-referenced score interpretation
An interpretation that involves comparing a test score with the scores of individuals in some identifiable group, known as the norm group. Norm-referenced scores are contrasted with criterion-referenced score interpretations, which involve comparisons with an absolute standard or an ordered set of performance descriptions. Also, the average score of a group might be compared with the averages of other groups with which meaningful comparisons are desired by the user.
A type of standard score having a mean of 50 and a standard deviation of about 21. These scores are used most often to show change or growth for program evaluation purposes. Unlike percentile ranks, which they resemble, it is appropriate to do arithmetic computations with normal-curve equivalents.
A graphic display of scores for a large group that has the shape of a bell: many persons have scores in the middle and a much smaller number have very high or very low scores. Many physical and psychological characteristics demonstrate a normal, or bell-shaped, distribution when shown graphically. (It is often called the normal curve.)
A collection of scores from some group that describes how the group performed on a particular test. The set of scores is used as a basis of comparison, to permit norm-referenced interpretations, in trying to obtain meaning from a test taker’s score. The different kinds of collections are often named after the group they represent: national norms for national students, local norms for students in a local school district, and Texas norms for all students in the state of Texas. The norms also can be labeled by grade level, gender, or even region of the country (as Midwest norms). (See also norm-referenced interpretation.)
A test containing items that can be scored without any personal interpretation (subjectivity) required on the part of the scorer. Tests that contain multiple choice, true-false, and matching items are examples. (See also subjective test.)
The use of an achievement test that is designed for students either above or below the actual grade level of the student who takes it. Often a lower-level test is used with students who are working on or receiving instruction on content well below that of their grade peers. Sometimes a higher-level test is used with students who are achieving well above their grade peers in most areas of the school curriculum.
Two or more forms of a test that have been demonstrated (statistically) to measure the same ability, with the same degree of accuracy. They yield scores for an individual that are interchangeable in meaning. Also known as alternate forms, equivalent forms, or comparable forms.
A score that shows the percent of total possible score points a test taker obtained. It is the test taker’s raw score divided by the perfect score and multiplied by 100. For example, the percent-correct score for a student who got 15 right on a 25-item test is 60%. It is often confused with a percentile-rank score.
The test score below which a certain percent of a norm group has scored. For example, if a test taker’s score of 25 is at the 80th percentile, this means that 80 percent of the group with which their score is being compared obtained scores below 25. (See also percentile rank.)
A ranking from 1 to 99 that indicates what percent of a norm group obtained scores lower than the one a certain student obtained. For example, if a test taker gets a percentile rank of 80, this means that 80 percent of the group with which their score is being compared had lower scores on the test. (See also percentile.)
A range of percentile ranks, often appearing on a score report, that shows the range within which the test taker’s “true” percentile rank probably occurs. The “true” value refers to the rank the test taker would obtain if there were no random errors involved in the testing process. (See also standard error of measurement and confidence band.)
An assessment tool that requires test takers to perform—develop a product or demonstrate a process—so that the observer can assign a score or value to that performance. A science project, an essay, a persuasive speech, a mathematics problem solution, and a woodworking project are examples. (See also authentic assessment.)
When reporting the results of achievement testing for groups of students, the score range can be divided into segments or levels. Each level is given a label, and the performance of students scoring within each range is described in terms of what those students know or can do. Examples of such levels are “basic”, “proficient”, and “advanced”. They also are known as proficiency levels or achievement levels.
A description of the minimum performance required to judge a test taker’s score to be “good enough”. The description is used in a standard setting procedure to determine the cut score required to differentiate performance levels, such as “pass” rather than “fail”, be certified or not, or be judged proficient or not.
A test administration that occurs during the development process to check on the quality and appropriateness of test items, administration procedures, scoring, and/or reporting. Sometimes the purpose is to check on the impact of optional ways of administering, scoring, or reporting. (See also field test.)
A test designed to determine which course, in a sequence of courses, would be optimal for a student to enroll in to begin study. These tests often are used by colleges to determine which of several mathematics, chemistry, or foreign language courses is the best starting place for a student who has taken courses in these areas in high school.
A systematic collection of work samples provided by an individual to demonstrate their level of proficiency in some field or curricular area. Often students in art and music develop such collections to communicate their style and abilities for admission to universities or as evidence in applying for special awards. An athlete’s videotapes represent another example. Portfolios also are used to document or determine growth or improvement, as with writing portfolios.
Information gathered in the process of validation to show that scores from one test are related to criterion scores collected at some later point in time. Correlation coefficients often are used as a form of such evidence. (See also criterion-related evidence and validity.)
primary trait scoring
A method of holistic scoring of work products (like writing) or performances (like a musical selection) in which overall performance is rated and given a single score. However, the scoring guide is designed to evaluate the work in terms of the primary trait the test taker was asked to demonstrate. So, for example, if the test taker is asked to “describe” something in writing, the focus is on description, but if the test taker is asked to “persuade” in writing, the focus of scoring is on the persuasive arguments the writer demonstrated. (See also holistic scoring and analytical scoring.)
When reporting the results of achievement testing for groups of students, the score range can be divided into segments or levels. Each level is given a name and the performance of students scoring in that range is described in terms of what those students know or can do. Examples of such levels are “basic”, “proficient”, and “advanced”. These also are known as achievement levels or performance levels.
profile of scores
A graphic representation of scores from several tests or subtests, often from a test battery, for which it is appropriate to compare performances across tests. Relative strengths and weaknesses in performance might be identified from a line graph or bar graph of scores. Such a technique is often used for diagnostic purposes.
A verbal or pictorial display that is used in various types of performance assessments to present the context or theme about which the test taker should respond. A writing prompt, for example, tells the test taker what the writing topic should be and whether the writer should describe, persuade, inform, or explain something in the process.
Literally, the term refers to psychological measurement. Generally, it refers to the field in psychology and education that is devoted to testing, measurement, assessment, and related activities.
Race to the Top
A competition among the states for educational funding created by legislation enacted by the U.S. Congress in 2009 to promote reform in states and K-12 school districts. (Also known as R2T, RTT, and RTTT.)
An item response theory model, also known as the one-parameter logistic (1PL) model, that involves only one item parameter, difficulty. (See also item response theory and difficulty.)
The score obtained by a test taker reflecting the number of items correctly answered, or the number of points awarded during scoring, for the responses given. Raw scores often are converted to some kind of derived score, such as percentile rank, scale score, or performance level, for interpretive purposes.
A test used to determine whether the test taker has the fundamental prerequisites needed to benefit from a particular instructional program. Such tests are sometimes used to determine whether young children have the pre-reading skills needed for beginning formal reading instruction. Historically, such tests sometimes were used to determine whether children were ready to attend school.
The characteristic of a set of test scores that relates to the amount of random error from the measurement process that might be embedded in the scores. Scores that are highly reliable are accurate, reproducible, and consistent from one testing occasion to another. That is, if the testing process were repeated with a group of test takers, essentially the same results would be obtained. Various kinds of reliability coefficients, with values ranging between 0.00 (much error) and 1.00 (no error), are usually used to indicate the amount of error in the scores. (See also standard error of measurement.)
A scoring guide used in the process of rating or scoring performance assessment items or groups of items. It shows the essential aspects that should be present in a response in order to assign it a value, ranging from the lowest possible score to the highest possible. For example, for an item for which 0 to 5 points can be awarded, the rubric would describe what must be present in the response in order for the scorer to assign it any one of the six possible scores.
A kind of score to which a raw score has been converted for ease of interpretation. (See also derived score.)
The process of assigning numbers to increasing levels of performance on a test. The resulting score scale shows numbers increasing as test performance increases. Generally, raw scores are used as the basis for developing scales such as standard scores, percentile ranks, or grade equivalents, which are more useful than raw scores for interpretation purposes.
The distance between the lowest and the highest possible scores on whichever score scale is being used. In some cases, “the range” refers to the distance between the highest and lowest scores obtained by a particular group on a particular occasion.
An objective-test item that allows the test taker to select one or more of the given options or choices as their response. Examples include multiple-choice, true-false, and matching items. These are contrasted with constructed-response items.
An assessment instrument for which the respondent answers questions or makes ratings of his/her own behavior or performance, as opposed to such responses being made by an observer of that individual. (See also inventory.)
A test item that requires the test taker to furnish a word, phrase, sentence, or numerical response. These are constructed-response items that require only a brief response rather than several paragraphs or pages (an extended response).
standard age score
A standard score often reported with intelligence or aptitude tests to permit norm-referenced interpretations using various age groups.
A statistic that describes how much the scores in a particular group vary; it is a measure of variability. Conceptually, the number indicates the average amount by which the scores in a group differ from their mean score. It is also the square root of the variance.
standard error of measurement (SEM)
In classical test theory, a statistic computed for a set of scores using the standard deviation of those scores and a reliability coefficient. It is used in score interpretation to show the amount of error that likely would be associated with a particular obtained score. (See also confidence band.)
A derived score, used for making norm-referenced interpretations, for which the mean and standard deviation are selected to simplify interpretations. Some common standard scores and their corresponding mean and standard deviation are: ACT scores (18 and 5), SAT scores (500 and 100), T-scores (50 and 10), and z-scores (0.0 and 1.0).
The process of identifying the scores (cut scores) on a score scale that define the starting and ending points of various performance levels used for reporting test performance. For example, the process is used to determine the lowest score that could be obtained to categorize performance as “pass” rather than “fail”. (See also performance levels.)
The process of obtaining norms from a representative sample of individuals under standard conditions of test administration and scoring. (See also standardized test and norm-referenced score interpretation.)
An assessment tool that has a “sameness” to it for all who take it, in terms of the items presented, the procedures used to administer it, and the methods used to score it. Unless all conditions are the same when different groups are given the test on different occasions, it is not meaningful to compare their scores or to combine their scores to describe overall group performance. Such sameness is required when norms are acquired in a process called standardization. However, standard conditions also are essential, even when norms are not used for score interpretation purposes, if scores from multiple groups tested in different places at different times are to be combined.
Assessments that are developed to measure student attainment of a specific set of content standards. The test specifications detail the content and cognitive processes that have been the focus of student learning, as written in the school’s content standards. (See also content standards and test specifications.)
A kind of standard score for which the digits 1 through 9 are used to describe test performance. The mean value is 5 and the standard deviation is 2.
Norms, often expressed as percentile ranks, that describe how students in the state at a particular grade level have scored on a certain test. State norms can be obtained when all students, or a representative sample of students, in the state at the grade level of interest have taken the same test or parallel tests.
A question or incomplete sentence that poses a problem in a selected-response test item, most often a multiple-choice item. The stem is usually followed by a list of options, which includes distracters and the correct answer, or the keyed response.
A test that requires some judgment or subjectivity in the scoring process. Examples include essay tests, writing assessments, and other types of performance assessments. (See also objective test.)
summative use of assessments
Using assessments at the end of an instructional segment to determine the level of students’ achievement of intended learning outcomes or whether learning is complete enough to warrant advancing the student to the next segment in the sequence. This is contrasted with formative use of assessments.
An evaluation instrument, usually composed of questions or items, which have right answers or best answers, that is used to measure an individual’s aptitude or level of achievement in some domain. Tests are usually distinguished from inventories, questionnaires, and checklists as evaluation devices.
When more than one version of a particular test is available, the versions are generally referred to as forms. Multiple forms of the same test should not be used interchangeably unless the score user or test developer has provided evidence of comparability. (See also parallel forms.)
Any of a number of activities in which a prospective test taker might participate, primarily for the purpose of optimizing their score on an upcoming test. The nature of the activities, the circumstances under which they are presented, and the purpose of the test must all be considered in deciding whether specific test preparation is either appropriate and beneficial or inappropriate and unethical or merely unhelpful. (See also coaching.)
Written details prepared early in the test development process to describe many of the characteristics of the resulting test. Generally included in such a document are: the content to be covered, the level of cognitive functioning to be displayed with the content, the proportional weighting of the various content areas, the kinds of items to be used, the number of items of each type to be used, and how much time will be permitted for testing. Brief versions of the specifications are sometimes referred to as test plans or test blueprints.
The amount of skill in test taking possessed by an individual. The skill relates to such things as time management, how to guess among options when the test taker has partial or little knowledge about the ideas in a test item, how to provide constructed responses that might be most appealing to scorers, and how to identify unintended cues in items prepared by less-experienced test developers.
The score a test taker would receive if there were no influences of random error in the testing process. In classical test theory, the hypothetical score a person would receive if the same test could be given many times and the scores were averaged across those occasions. (See also error score and classical test theory.)
Norms obtained by collecting scores from the group of test takers in the same grade who have been given the same test under standard conditions. Generally, such norms are not based on a representative population of interest and, therefore, contain some amount of bias when used in making norm-referenced interpretations. For example, scores from a test given to students in many parts of the country might be labeled as “national” norms because of their varied geographic origin, but such norms would not be true “national norms” unless the students, or their schools, were shown to be a representative sample of students in the nation in those grades. (See also national norms.)
The process of gathering evidence to (a) support particular meanings that the user would like to attribute to scores from a test and (b) demonstrate that it is appropriate to use the scores from that test in the way(s) the user has chosen. Though the burden generally is on the user of test scores to provide the evidence, often the test developer furnishes information about test purpose, test development procedures, and appropriate test administration and scoring procedures. (See also validity.)
The degree to which the evidence obtained through validation supports the score interpretations and uses to be made of the scores from a certain test administered to a certain person or group on a specific occasion. Sometimes the evidence shows why competing interpretations or uses are inappropriate, or less appropriate, than the proposed ones.
A statistic that describes how much the scores in a particular group vary; it is a measure of variability. Some statistical techniques used in testing depend on being able to partition the variance and attribute the parts to various test, test administration, or test taker characteristics. Statistically, it is also the same as the standard deviation squared.