Exam Registration: 1-800-947-4228
Apply to Become a Testing Center!

White Paper Series: Technical Notes on Test Security

View PDF

The Number Seven: Visualizing Item Compromise

Anne Thissen-Roe, Ph.D.
October 28th, 2016

"What I'm not seeing here is the number seven," I recently announced to my test security team. In response, I got some rather puzzled looks -- understandable, given that I was pointing to a scatterplot. Oops, I thought. Time to explain.

In the course of administering a test to candidates, occasionally we receive "tips" that tell us test security has been breached. For example, a third party might tell us that a school or test prep course has been teaching operational items. It is our duty to evaluate such a statement through the means available to us, including looking at the data for signs of unfairly advantaged candidates.

In this paper, I would like to share a simple visualization method for distinguishing "teaching the test" from merely "teaching." I'll show some simulation results in which I've modeled different candidate behaviors (or teaching behaviors), so that you, the reader, can learn to quickly recognize them. Finally, I'll provide my simulation code, written in the R language, so that you can experiment with the conditions, and perhaps compare model-generated results to your own data.

The Plot
The visualization itself is straightforward and not difficult to construct from actual test data:

  1. Identify the suspect (or target) candidates and the non-suspect (or reference) candidates, using metadata or ancillary data about who attended the school or test prep course in question. This is likely the hardest part.
  2. Calculate the observed probability of getting each item correct based on the reference candidates who were presented that item. This is the number of correct responses divided by the number of presentations. If a candidate skipped the item, put it in the denominator but not the numerator; if it wasn't on the candidate's test form at all, don't put it in either one.
  3. Calculate the observed probability of getting each item correct based on the target candidates who were presented that item.
  4. Make a scatterplot where each item is a point. The horizontal axis is the probability of a correct response from a reference candidate (from Step 2) and the vertical axis is the probability of a correct response from a target candidate (from Step 3).

If the points roughly line up on the diagonal, running from bottom left to top right, there's nothing to see here. Your target candidates are just like your reference candidates. It's okay if there's some variation. It's normal for some items to be a little easier and some to be a little harder, depending on where you were trained. There's also some "spread" in the scatterplot due to pure chance. Figure 1 shows an example of a "no effect" plot.

 

Figure 1. A scatterplot of the probability of responding to each item correctly, for a candidate in the suspect group (vertical axis) compared to one in the reference or non-suspect group (horizontal axis). The candidates in this plot are equivalently prepared.

 

By the way, if you have multiple forms, with or without overlapping items, there's no need to plot them separately -- at least, not initially. When you get deep into an investigation, you might want to, but only if you've got good reason to think forms matter. Up front, however, all of the items can be lumped together. I've found that this technique is not very sensitive to the number of candidates involved, unless that number is very small (like ten). As the numbers of candidates get smaller, the plot "blurs" but doesn't lose its general shape.

The Number Seven
On this type of scatterplot, the teaching of operational items produces a distinctive pattern, one that resembles the number 7.

When candidates memorize specific operational items rather than generally learning the material being tested, not only do those items have a high probability of a correct response, but the items lose their relative difficulty or easiness. A memorized item is only as hard as recognizing the correct response, regardless of how hard or easy the item was originally. This causes the memorized items to not only appear at the top of the scatterplot, but be close to horizontal -- all equally easy.

Any items that have not been memorized, such as new or pretest items that may not yet have leaked, will still appear on the diagonal, provided the candidates' basic skills don't actually differ. (I'll discuss cases where the suspect candidates are actually more or less qualified later.)

Now, I'm not one of those who assume that candidates will study operational items until they have 100% perfect performance on them; based on reports from the field, it seems like 80% or 90% performance on memorized items is more likely. I figure that a candidate who forgets a memorized item will fall back on trying to figure the item out with her own skills. Then, the item difficulty does become a factor; the candidate will get the easy item right either way, but is likely to respond incorrectly to the difficult item. This introduces a very slight slope to the top bar of the 7. It's still easy to distinguish from the diagonal point cloud.

Figure 2 shows the general 7 shape, for the case where the suspect candidates had memorized 60% of the test, and had adequate-but-unexceptional preparation for the remainder of the items. The 7 is recognizable for a wide range of exposure rates. In Figure 3, the candidates had memorized 20%, whereas in Figure 4, they had memorized 90%.

"But what if the candidates have memorized the entire item bank?" you ask. Well, with 100% memorization, the number 7 mnemonic doesn't work anymore, because there is no diagonal (Figure 5). However, the visualization can still be valuable. The top-bar-only shape is still distinct from that produced by legitimate teaching of the content, as discussed in the next section.

 

Figure 2. The suspect candidates had preknowledge of 60% of the items in the bank. Those items form the top bar of the 7; the rest are still on the diagonal.

 

 

Figure 3. The suspect candidates had preknowledge of 20% of the items in the bank.

 

 

Figure 4. The suspect candidates had preknowledge of 90% of the items in the bank.

 

 

Figure 5. The suspect candidates had preknowledge of 100% of the items in the bank. There is no diagonal point cloud at all.

 

Other Stories
Up to this point, we have assumed that, other than the presence or absence of preknowledge of operational items, the candidates in the target (suspect) and reference (non-suspect) groups are equivalent. Let's consider a couple cases where they're not.

First, what if the suspect candidates just plain know their stuff? Well, in that case, the items will appear above the diagonal on the scatterplot, up toward the top left. However, for reasonable degrees of better preparation or higher ability, they don't go very far up and left. Figure 6 shows better preparation on the order of one standard deviation (with the mildly higher variability I'd expect from a longer or more intense training course); the candidates average 84th percentile in the overall population, but the upward curve of the point cloud is subtle. Even with two standard deviations better preparation (Figure 7), the upward movement isn't as dramatic as in the case of operational item preknowledge (compare Figure 5).

More to the point, the point cloud curves up, concave down. It doesn't make a 7 shape, or even look like the isolated top bar of a 7. The top-left edge, or the invisible "envelope" around it, looks to me more like a receiver operating characteristic (ROC) curve. The top right and bottom left corners are still anchored -- that is, easy items are easy for everyone, and truly impossible items are still impossible -- while the items at moderate levels of difficulty shift the most.

Now, it is certainly possible to confuse the issue. Let's say some unusually well-prepared candidates are shown a small number of operational items -- perhaps in one of those cases where a teacher doesn't know that using "old" items is wrong, or perhaps because a few items showed up on Facebook. The resulting scatter of items may not display either characteristic pattern, but rather something indistinct, as shown in Figure 8.

One might, however, worry less about a failure to detect cases where highly-prepared candidates saw a small number of items, because they would have passed anyway. It is true that someone broke a rule; it is true, and more important, that some items have leaked and will probably reach less well-prepared candidates in time. But it's important to keep perspective.

In the test security literature, it's more classic to consider the opposite case: the worst candidates are the ones that seek and obtain item preknowledge. Alternatively, a test prep course might teach operational items and not bother training the candidates on the actual material. These candidates are definitely ones we want to catch, as they might pass the test without adequate preparation and, afterward, put someone's safety at risk due to their lack of actual qualifications.

If the candidates are indeed woefully underprepared, except for their item preknowledge, the 7 mnemonic suffers, but the resulting pattern is still recognizable. The diagonal point cloud curves down, concave up, and the left side of the top bar dips.

 

Figure 6. The suspect candidates had no preknowledge, but were unusually well-prepared for the exam. Their ability levels are higher (1 SD, or a group average at the 84th percentile overall). They are also more variable (50% greater variation), because top students are likely to benefit more from general (non-targeted) instruction. Visually, note that the point cloud has been "bent upward" and is slightly concave-down; it does not reach the upper left corner of the plot, and does extend down and left. That is, difficult items are still relatively difficult.

 

 

Figure 7. The suspect candidates had no preknowledge, but were exceptionally well-prepared for the exam. Their ability levels are very high (2 SD, or a group average at the 97th percentile overall). Even with extreme high ability, the point cloud doesn't attain the nearly-horizontal, consistently-narrow appearance produced by 100% item preknowledge. The easy and moderate items have shifted up higher, but there is still a thin scatter of items in the lower left quadrant of the scatterplot. That is, even for a group of very high ability candidates, the most difficult items are still difficult.

 

 

Figure 8. The suspect candidates were much better prepared overall, and also had preknowledge of 10% of the items in the bank. In this case, the top bar of the 7 has been reduced to an indistinct point atop the curved-up point cloud. It is not obvious and may not be recognizable at all.

 

 

Figure 9. The suspect candidates had preknowledge of 60% of the items in the bank, but were otherwise poorly prepared. Visually, the 7 has turned into a fish-hook. The point cloud for unexposed items sags, concave-up.

 

Conclusion
The simple scatterplot methodology can be used to recognize and distinguish a variety of cases of legitimate and illegitimate test preparation.

If you have a real case, and want to test out your theory of what might be happening behind the scenes, it can be helpful to produce simulated data from an appropriate model, and see what patterns it can make visually. Repeated draws of random data can give you an idea of the range of shapes possible from a given data generating process. For that reason, I've included (below) the simulation code that I used and modified to generate all of the scatterplots in this paper. Please feel free to use and tweak it as appropriate for your own situation. Good luck!


 

Simulation Code

The following code is written for the statistical package R. Only the base install is required.