September 23, 2009

Guide to assessment jargon

Suzana Lopes presents a jargon buster that demystifies some of the terminology used in assessment.

The world of assessment is a highly scientific one, and even the most traditional paper-based tests that have been delivered for decades are built on highly complex psychometric theory to ensure that they are the most accurate reflection of the skills, knowledge and competencies they are designed to assess. Meanwhile, more modern testing methods such as those delivered on-screen have given rise to a worldwide industry – and accompanying language – all of their own.

This brief guide lists some of the most common buzz-words and terms, and gives explanations in layman’s terms in the hope of breaking down some of the barriers that might otherwise threaten to grow up between testing companies and the awarding bodies, companies and organisations that test people every day.

Accommodations. Extra provisions provided to a test taker for some particular reason, most often a disability of some kind. Examples of accommodations include a reader to read the test on the test taker’s behalf or extra time allocated to the test length to account for factors such as dyslexia.

Adaptive testing. An advanced psychometric method for accurately targeting an individual’s knowledge level. The test algorithm selects an item to start and if the candidate gets that one right then a harder one is selected, or if wrong an easier one. This continues until the candidate is consistently getting one right and one wrong, like an oscillating wave of limited height and depth. This is currently used to great success in nursing exams.

CBT. The acronym for Computer Based Testing. The most high-profile example of a Computer Based Testing programme is the UK’s Driver Theory Test, which is currently the world’s largest CBT contract.

Classical Test Theory. A psychometric model that can be applied to the test creation and analysis process. Compared to IRT, it is an older model and can be used in conjunction with IRT modelling or on its own.

Constructed response. A type of task in a test that requires the test taker to create or produce an answer, such as writing an essay or filling in missing words or phrases without any prompts. As opposed to selected response.

Content domain. A technical term for a grouping of items based on their content or subject matter. Each content domain must be of a sufficient size to be meaningful. For example, in a health and safety test, if there are only two questions in the “Fire Safety” content domain, then it will be impossible to give the test taker worthwhile feedback if they miss one of them, or to gauge whether or not they really have any understanding of that topic.

Diagnostic. Diagnostic assessment is the term given to the type of testing that is often carried out at the beginning of a learning process or sometimes while a test taker is in the process of learning, so that ongoing teaching and subsequent learning can be tailored according to the outcome of the assessment.

Distractor. One of the choices in a multiple choice type of question. Not the correct answer, but one that is carefully designed to weed out certain incorrect thought processes. The more plausible a distractor is, the more difficult the question is.

E-Assessment. Another term for Computer Based Testing, often used in academic markets.

Equating. The process of ensuring that scores on different forms of a test (test forms) share the same statistical characteristics. For example, a test taker can re-sit a test a number of times, and although the test questions will be different each time, the scores the test taker receives will be directly comparable.

Facility value. The difficulty level of an item.

Feedback. A report on how the test taker performed on a test, which can often be provided immediately after testing. Test takers may be provided with their pass/fail status, a numeric test score, and/or more in-depth analysis and advice based on how they performed on the various content domains within the test. This is particularly relevant for formative and diagnostic testing.

Formative. As with diagnostic assessment, a type of testing that runs alongside the learning programme, so that the testing process becomes integrated into, and part of, the test taker’s learning outcomes.

IRT or Item Response Theory. A psychometric model, applied in the test creation and analysis process, which describes how different types of test takers will respond to each item.

Item. The generic term for an individual question or task that makes up a test.

Item bank. The pool or set of items from which the test creator builds a test. Withitem banking, the test creator can select items from an existing pool and create a new test form for each sitting or time period. This is as opposed to using static test forms, for which all new questions have to be created for each sitting.

Item discrimination. The ability of an item to determine the difference between those who have a required skill or competency and those who do not. Not to be confused with any negative or unwanted forms of discrimination – such as by cultural differences, language or disability – which tests must be carefully created to avoid.

Item type. A specific kind of item, of which there are many – from multiple choice or essay questions to more high-tech on-screen item types such as simulations, 3D modelling tasks or drag-and-drop tasks.

Key. The correct answer to an item on a test.

Psychometrics. Thefield of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. It is used during test creation and analysis to ensure that these tests do what they are required to do in the best possible way.

Reliability. The consistency with which a test measures a trait or skill.

Score report. A feature of CBT: the report the test taker receives on completing a test, which indicates his or her performance on the test. Sometimes this is a final result, sometimes it is a result for a portion of a test with a final overall result to follow.

Selected response. A type of task in a test that requires the test taker to choose a response from a given set of options, for example in a multiple-choice or multiple-response question. As opposed to constructed response.

Stakes. The outcome that depends on the passing of a test. For example, a high stakes test could be a medical or other professional exam, on which an entire career could hinge. Generally, the higher the stakes of the test, the more thorough and rigorous it needs to be. A high-stakes test is often characterised as a test on which a selection is based, for example for entry to a specific profession.

Stem. The part of the item that leads into the test taker’s response, whether it be a question or an instruction.

Summative. The type of assessment that takes place at the end of a course of learning, to sum up and measure the outcomes of that learning.

Test form. One version of a test. Multiple test forms (containing different sets of items and whose scores are equated) may be operational at any one time to eliminate the over-exposure of individual items when there are large numbers of test takers. Multiple test forms also ensure that test takers who sit the test on more than one occasion receive a different set of items at each administration.

Testing window. The time slot within which a test can be taken. Paper-based exams are typically administered in short testing windows (often just a single day) because of the operational issues involved in securely handling all the paper.

Test forms. Computer Based Testing enables tests to be available in longer testing windows, even continuously, permitting test takers to sit the test on demand, and eliminating the need for managing large volumes of paper forms.

Validity. The extent to which a test measures what it is intended to measure.

Is there anything we should add to this guide – if so let us know by commenting here on the article.

Suzana Lopes is EMEA VP Sales and Marketing at Pearson VUE.

Suzana Lopes presents a jargon buster that demystifies some of the terminology used in assessment.

E-Assessment. Another term for Computer Based Testing, often used in academic markets.

Facility value. The difficulty level of an item.

IRT or Item Response Theory. A psychometric model, applied in the test creation and analysis process, which describes how different types of test takers will respond to each item.

Item. The generic term for an individual question or task that makes up a test.

Item discrimination. The ability of an item to determine the difference between those who have a required skill or competency and those who do not. Not to be confused with any negative or unwanted forms of discrimination - such as by cultural differences, language or disability - which tests must be carefully created to avoid.

Item type. A specific kind of item, of which there are many - from multiple choice or essay questions to more high-tech on-screen item types such as simulations, 3D modelling tasks or drag-and-drop tasks.

Key. The correct answer to an item on a test.

Reliability. The consistency with which a test measures a trait or skill.

Stem. The part of the item that leads into the test taker’s response, whether it be a question or an instruction.

Summative. The type of assessment that takes place at the end of a course of learning, to sum up and measure the outcomes of that learning.

Validity. The extent to which a test measures what it is intended to measure.

Is there anything we should add to this guide - if so let us know by commenting here on the article.

Suzana Lopes is EMEA VP Sales and Marketing at Pearson VUE.