Exam Registration: 1-800-947-4228
Apply to Become a Testing Center!

White Paper Series: Current Issues in Test Development

View PDF

High Fidelity in Items and Response Data

Anne Thissen-Roe, Ph.D.
May 17, 2016

The stereotypical test is a spartan thing: a spare setting, a #2 pencil, a booklet of pages of minimalist questions followed by four response options each, free of clues, free of context, very little like the outside world. A score, too, is stark and spare: a person is reduced to a number, for the purpose of comparison to a standard or to other competing persons. The information channels to and from the test-taker are, traditionally and stereotypically, narrow.

We have the technology and the understanding to create wider channels in both directions. We can use those channels to create a rich, appropriate, valid context around the test we administer, and an equally rich, accurate depiction of the candidate. We can, and should, create a symmetrical, high fidelity information exchange: a test with high fidelity to the work, and a score report with high fidelity to the candidate.

Fidelity to the Work
For most jobs, a work sample is the single most valid predictor of performance (Schmidt & Hunter, 1998). This is unsurprising: in a work sample, a candidate performs actual job tasks. The job is simulated for standardization, but is as realistic as possible.

For example, a plumber might be presented with a sink and a kit of parts, and asked to install its pipes in front of a pair of experienced plumbers, who act as evaluators. A software developer might be asked to write a program to meet a given set of specifications, just as she would be required to do on the job. Auditions in the performance arts work the same way: a candidate for first violin in an orchestra is asked to play the violin.

Certification and licensure exams make use of work samples in the form of practical or performance tests. For example, US Medical Licensing Examination (USMLE) Step 2 candidates treat a series of Standardized Patients, who are actually actors simulating various ailments. Chefs seeking any level of certification through the American Culinary Federation must cook representative food in an evaluation kitchen. The Operating Engineers Certification Program requires candidate crane operators to perform typical operations with a crane of appropriate type and a standard load, precisely and safely, within a specified time limit. In all these cases and more, having candidates actually do the job is as powerful for verifying minimum safe competence (e.g., for a license) as it is for identifying the best of many contenders (e.g., in highprestige auditions).

The great advantage held by work samples and similar assessments is high fidelity to the work. By contrast, a traditional paper and pencil, multiple choice test has low fidelity to the work. That is, there are numerous ways in which such a test does not resemble doing the job. A candidate might think, "My test score shows how well I fill in bubbles on a Scantron, not how well I work." To an extent, the candidate is right -- her score has a job-irrelevant component, although a testing program which adheres to the National Commission for Certifying Agencies (NCCA) 2016 Standards for the Accreditation of Certification Programs (Institute for Credentialing Excellence, 2014), 2014 Joint Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014), and/or Principles for the Validation and Use of Personnel Selection Procedures (SIOP, 2003) will have gone to some trouble to ensure that the test score has a valid, job-related component that is worth testing each candidate.

Low-fidelity test scores pose a challenge in the form of extracting a signal from noise. The "signal," which is job-related variation in the scores, must be heard by the test score's user (e.g., the agency granting licenses or certificates), despite the intrusion of "noise," which results from both non-systematic measurement error and the systematic effects of job-irrelevant aspects of the test.

It follows logically that high-fidelity tests will show higher validity than low-fidelity tests, provided their measurement properties are good. I'll talk more about measurement quality in high-fidelity tests in a forthcoming white paper. For now, let's consider the tests themselves.

High-fidelity assessments are sometimes described as simulations, because a candidate performs real work tasks, but without real work's consequences to an employer -- or, in the case of licensure and certification, the job and employer themselves may be fictional elements in service of the assessment.

Certain types of written test, and the items therein, are described as low-fidelity simulations. These include situational judgment items, which present a work simulation and ask the candidate to rate, rank, or select among possible responses. The presentation of the situation can be as simple as a verbal description. The candidate is making a decision that is work-like in nature, making it a simulation of that single element, but the decision is framed by an interaction with the test (e.g., a multiple-choice item) that is not like anything on the job.

There are degrees of fidelity to the work in between high and low. In recent years, technology has enabled a profusion of moderate-fidelity, or one might say, higher-fidelity tests that can still be standardized, administered and usually scored by computer. These are sometimes grouped under the name "innovative item types." Some examples follow.

Images, whether photographs, pictures or diagrams, have long been used in traditional testing. However, the use of images was constrained by the ability to reproduce the images clearly on a large enough scale. Black and white line drawings were common in twentieth-century paper test booklets. In the present day, the inclusion of a full-color, high-resolution digital photograph in a computer-administered item is no trouble at all, and can add to the realism and validity of a test. For example, as a component of a clinical diagnosis item for a veterinarian, a photograph of a sick or injured animal, taken during the course of actual treatment, can be included. The item can then test visual elements of diagnosis, thereby assessing a wider spectrum of the vet's diagnostic skills.

Audio and full-motion video can be used as well, with far greater convenience than in historical testing situations. Audio items are essential in tests of listening skills in various technical contexts: foreign language media monitoring by an intelligence agent; Morse Code transcription by a maritime radio operator; pitch recognition by a musician; understanding of specialized communications from Air Traffic Control by a pilot, or dispatch by an emergency responder; and many more. Presenting the stem or prompt of an item as a video adds to the realism of situational judgment items, but video items can just as well be used to predict a candidate's ability to interpret spoken or sign languages in an in-person, live setting.

Some innovative item types alter the method of making a response as well. Hotspot items ask the candidate to click or tap on the correct part of an image, and score the response correct if the tap or click is within a region designated "close enough." For example, a healthcare professional whose job requires interpreting radiographic or ultrasound images, or MRIs, might be asked to identify a feature of one: a tumor located somewhere in an MRI; a dental cavity visible in a set of X-rays; part of the anatomy of an embryo at twenty weeks gestation in an ultrasound still frame.

Many work tasks are themselves executed on the computer. Some skills tests of computerbased tasks provide a limited or "sandboxed" version of a software interface actually used for the task in a work context. Software developers can insert fragments of code or fix bugs in a simulated project and development environment, and have them scored automatically. CPA Exam candidates may find themselves filling in an IRS form or computing missing values in a spreadsheet. USMLE Step 3 candidates use software to order medications in real time for a simulated patient whose condition is changing.

All of these innovative item types have a feature in common: higher "downstream bandwidth" from the test-maker to the test-taker. That is, they achieve greater fidelity to the work by providing more job-relevant contextual information to the candidate. The enriched context clarifies the problem statement for the qualified candidate, and allows her to orient herself to the situation. In addition, as some industrial-organizational psychologists put it, the test doubles as a "realistic job preview," which is considered a feature in employee selection because it offers poor-fitting candidates the opportunity to opt out, and good candidates the opportunity to prepare and be confident. The realistic job preview is less relevant in other settings (e.g., certification and licensure; promotion testing; educational testing), but qualified candidates still benefit from the enriched flow of information. (It is up to the test developer to ensure that unqualified candidates do not benefit, at least in terms of higher scores.)

Fidelity to the Candidate
Downstream bandwidth in a test, flowing from the test-maker to the test-taker, has an obvious and natural counterpart: upstream bandwidth, that is, information about the candidate's test performance or item response behavior. This kind of information flows from the test-taker to the test score user. In contrast to high downstream bandwidth tests, which can have high fidelity to the work, high upstream bandwidth tests provide high fidelity to the candidate.

Yet most of the innovative item types I described don't provide any additional upstream bandwidth to match their downstream bandwidth. The candidate may still be responding to what is now a very complex item by selecting a multiple-choice option. She may be clicking her way through hotspots, but only "close enough" or "not close enough" is recorded; the rest of the click information is thrown away. Sandboxed software simulations are the most likely to encode and use more candidate behavior information than the narrow channel of a single multiple-choice selection.

High upstream bandwidth is desirable from the perspective of the qualified candidate as well as the score user. The score user may, depending on her use case, prefer that the additional information manifest as increased precision or as profile detail; one or the other may be better supported by the test structure and validity evidence. Either way, score users don't ask for more error in their scores.

For the qualified candidate, higher upstream bandwidth means the opportunity to demonstrate her competence unambiguously and completely, in less time and with less effort. A given population of candidates may focus more on the rapid testing aspect, or on the accuracy and the opportunity to perform. But either way, the qualified candidate benefits, and the brazen unqualified candidate has a harder time passing herself off as qualified.

For decades, we have looked for additional information about our candidates by asking better questions. We improve our item development processes; we change response formats; we administer tests with Computerized Adaptive Testing systems that progressively tailor the test to the candidate. Asking better questions leads to better efficiency of measurement, even with narrow-channel responses.

Not to say we should ask bad questions, but we can do more now than question-tweaking. We have the ability to record, model and score information from additional sources. Some examples follow.

The simplest and most common additional element of response behavior to be recorded, aside from the response itself, is response time or response latency -- that is, the time between when the item is first presented and when the response is made. Response time in psychometrics is typically measured in seconds. (Some psychological research uses response time in milliseconds; the researchers usually also term that reaction time.)

Response time gives us a clue about the candidate's response process -- how she came to make that response. Did the candidate recognize the right answer, having solved the same problem before? (The question might have been presented in a candidate's previous attempt at the test, or it might just be familiar, such as in the case of an arithmetic or vocabulary problem.) Did the candidate reason out the response? Was the response time atypically long, as in the case of an interruption, or an event we might term a security incident?

A pattern of response times, over the course of a whole test, can augment the pattern of responses by giving information about a candidate's familiarity with the material -- overall, or differentially between content areas -- as well as her overall level of caution. Answering many questions very quickly and correctly indicates familiarity; answering many questions very quickly, but some of them wrong, indicates that the candidate is willing to take risks in order to be fast.

We can obtain additional diagnostic information about the candidate from eye tracking or mouse tracking during testing. The path of a cursor in between clicks, as a time series, conveys information about what a candidate was thinking, and when, prior to making a response. Of course, the path is full of "noise" and accidental motions, but we may be able to see some indication of false starts, attention to distractors and irrelevant material, or early identification of the correct response followed by a period of self-checking. Or, ideally, the candidate glides the cursor over (or near) the relevant material in about the order we expect, and then down the responses until the correct one is reached -- and then clicks.

Eye tracking conveys a candidate's thought process, particularly during passage reading and image interpretation, more reliably and accurately than mouse tracking. It is technically more demanding, and potentially requires more up-front investment, because there's a second device involved. However, in proctored, high-stakes testing settings, a camera may already be pointed at the candidate's face for the purpose of cheat detection. If that camera is recording video at a sufficient frame rate and at sufficient definition, the same raw video stream can be parsed a second time to track the direction of a candidate's gaze.

High fidelity in both directions is exemplified in recent "gamified" testing systems, not to mention the computer (and console) games themselves. Games and gamified tests present a rich experience for the player (candidate), and record a rich stream of user actions, often at the semantic "in-world" level, which are timestamped and can be coupled with their context. Did you move left or right? Did you stand in the fire? How many failed attempts did you make before you successfully tossed the ball into the basket? What did you change between each attempt?

The field of formative assessment makes good use of the available diagnostic information in games to help students learn subjects like physics. In credentialing and in selection, the data streams would need to be integrated through a decision support algorithm such that a single score or pass/fail decision is provided; however, the additional data streams could still increase the precision of that score or the reliability of that decision.

A Balance of Efficiencies
Bandwidth does not come for free. High-fidelity items are, in general, more challenging and expensive to develop than their traditional counterparts. The technical infrastructure must support them. Multimedia elements must be created. Scoring algorithms must be designed, implemented and validated. Complex items must be vetted extensively for peripheral elements that introduce construct-irrelevant variance, or lead to differential item functioning. One argument in favor of low (or lower) fidelity is cost-efficiency.

The relationship between development cost (and effort) and validity is not always linear -- or even monotonic. In a classic result, assessment centers, sometimes used for managerial selection, typically require a candidate to be on site for two days for a battery of tests, and yet Schmidt and Hunter (1998) report a mean validity of only 0.37, lower than the test of general mental ability typically included in the battery. Today, it is similarly not clear that "more" multimedia always means more validity.

Consider situational judgment items, in which a work situation can be presented as a video with live actors, a 3D animation (voiced or captioned), a 2D animation, a still photograph, or a verbal description. There is a clear cost savings associated with each of those options, relative to the one before it. However, recent studies of situational judgment items have shown that the presentation of the situation can be attenuated -- even omitted entirely -- without as much loss of validity as one might expect (Lievens & Motowidlo, 2016). Worse, items using animation are subject to the so-called Uncanny Valley, wherein computer animation is realistic, yet just-barely-perceptibly unreal, and is perceived by viewers as "disturbing." Never mind what that does to candidate reactions and face validity; "disturbing" one's candidates seems like a quick way to introduce construct-irrelevant variance into a test score!

I do not mean to dissuade my readers from using animation or video in their items, especially in cases where the animation or video communicates information that would be difficult to express through other means. I do believe that the most costly selection -- or the one that looks the most "cutting edge" on paper before the project is initiated -- is not always the best or most valid alternative.

Yet, cost-efficiency is not the only efficiency worth considering. Higher bandwidth is generally good for the efficient use of a candidate's time and effort during administration. Operational efficiency on the part of the administrator might be considered as well; reduced administration time produces cost savings, while the need for additional equipment or modified facilities creates expenses associated with higher fidelity items.

All of these efficiencies should be balanced with each other, and considered in terms of the validity attainable, in deciding how much fidelity -- to the work and to the candidate -- to pursue in a given testing situation.

In the end, I recommend considering high-fidelity options in test development, both upstream and downstream. Their time has come.


References

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Institute for Credentialing Excellence. (2014). National Commission for Certifying Agencies Standards for the Accreditation of Certification Programs. Washington, DC: Author.

Lievens, F. & Motowidlo, S.J. (2016). Situational judgment tests: From measures of situational judgment to measures of general domain knowledge. Industrial and Organizational Psychology, 9:1, 3-22.

Schmidt, F.L. & Hunter, J.E. (1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124:2, 262-274.

Society for Industrial and Organizational Psychology, Inc. (2003). Principles for the Validation and Use of Personnel Selection Procedures, Fourth Edition. Bowling Green, OH: SIOP.