Exam Registration: 1-800-947-4228
Apply to Become a Testing Center!

White Paper Series: Current Issues in Test Development

View PDF

Stretching Your Item Bank: Cloning, Sampling, and Item Assembly

Anne Thissen-Roe, Ph.D.
April 11th, 2016

It is a sad reality of testing, to which most of us in the industry are long since resigned: Nearly every item bank out there could use more items, but producing new items is expensive. The good news is, there are technical solutions that can help.

This paper describes methods such as automatic item generation and assessment engineering, which can rapidly populate a large item bank. These methods may require greater up-front investment, but result in large numbers of related items, which may then be exchanged to create numerous alternate forms of a test.

Why Do We Need More Items?
First, let's take a step back. Why do we need more items, provided we have enough to compose a test form in the first place?

The main class of reasons for increasing the size of an item bank has to do with test security, in one way or another. For tests which are taken on paper in a classroom setting, it is often desirable to have enough different forms that no two adjacent students, and no two diagonally adjacent students, share the same form. In a grid, this can be accomplished with four forms, or as little as two if front-to-back adjacency is not a concern. Large-scale educational testing programs sometimes distribute six or eight test forms in repeating sequence so that odd numbers of rows or non-grid seating do not create pockets of adjacent identical forms. Computer-based testing programs may go one step further and present each student with an individualized test form, through technologies such as LOFT (Linear On-the-Fly Testing) or CAT (Computerized Adaptive Testing). LOFT and CAT depend on having several forms' worth of available items in a ready and waiting item pool.

Similarly, it is useful for a credentialing program that allows several testing opportunities per year to use different forms. This reduces the probability that candidates testing during a later window will benefit from a "brain dump" from earlier candidates, a lost or stolen booklet, or other more insidious forms of content exposure. In addition, candidates who retest are tested on the same content outline, but not the same specific items.

A testing program may have a regular (e.g., annual) update cycle in place, in order to ensure that the content that appears on the test reflects current practice in the profession, current science, current events, or other elements. A kindergartener today probably wasn't born in 2003, the solar system doesn't have nine planets anymore, and regulations that "will" be implemented in 2013 are passé. However, every now and then, the need for an update comes as a surprise to us, and is urgent as well. In that case, it's good to have a reserve of potential replacement items ready to go.

All of these factors point to the need to have on hand a substantial bank of fully-vetted, fully-edited items, preferably in categories of nearly interchangeable items for the easy creation of equivalent alternate forms.

Item Cloning: An Example
We can often expand or update an item bank by making minor revisions to individual items, while leaving substantial portions of text or structure in place as a phrasal template. For example, that kindergartener born in 2003 could easily be replaced with one born in 2010 or 2011. When we create these variants systematically, it is known as item cloning. Here is an example, using an arithmetic word problem for simplicity:

Noah has 3 toy cars and 4 toy trucks. How many toys does Noah have?

This item has several arbitrary elements to it, which can be easily swapped out for equivalents, to produce isomorphs of the item. It's like filling in the blanks in a Mad Libs game:

Combining just the listed substitutions, now we have 500 word problems calling for single-digit addition. That's plenty of items to keep students sitting next to each other from copying. Yet it would be trivial to construct more. Objects and categories are quick to brainstorm, and 3a and 3b can generally switch places without effect. The four names given are the two most popular boy names and the two most popular girl names of 2014 in the United States, as given by the Social Security Administration; the SSA is quite happy to provide the top thousand names of each gender, for each of more than a hundred years.

Most importantly, the two addends can be exchanged for any other number that creates a problem of appropriate difficulty for the test-taker. For a beginning student, the numbers might be low single digits as shown. Later the same year, the student might answer the same question, but with the number range extended up to 9. Double-digit addition works just as well for most of the object categories, though perhaps cats and dogs would need to be removed.

The given template doesn't work well for fractions or decimal numbers, and it only covers addition. Similar templates could be constructed for other operations. A phrasal template could be constructed for the subtraction of fractions:

The items produced could be adjusted considerably in difficulty through the use of matched or mismatched denominators, proper and improper fractions, and mixed numbers.

Models of Item Structure
When we use the term item model, we usually mean a model of item response. We often mean a mathematical model relating an underlying individual difference (be it subject matter knowledge or enthusiasm for a presidential candidate) to the probability of producing a particular response. We might also mean a model in which the production of a correct response depends on the accomplishment of a series of contingent mental tasks, a model of the decision to skip an item and go on to the next one, or a model of the time required to respond.

Here, however, we need a different type of item model: a model of item structure, including but not limited to a phrasal template. The preceding example demonstrated item templating (Bejar, 1996); the next step is to introduce a model of the cognitive processes involved in responding to a particular item (Embretson, 1999). The cognitive features of an item presentation are identified (or specified in advance). The features can then be used to predict the item's difficulty, and other parameters of the item response model.

Item templates are not limited to arithmetic word problems, although mathematics is a fertile subject for their development. They are effectively used in most if not all educational subjects, and are useful (but currently underutilized) in certification testing. In a certification context, many job tasks involve a consistent set of steps which a skilled professional applies to a variety of different cases. Such tasks may actually involve mathematics or be quantitative, or may involve making a decision based on non-numeric case features. One task statement, resulting from a job analysis, often represents several, if not many, possible branchings. For example, a job analysis might state that a line cook at a diner must be able to "portion and plate an entrée listed on the menu, according to Standard Operating Procedure." The diner's breakfast menu may offer a customer any combination of four items out of a list of thirty, resulting in over eight hundred thousand possible breakfasts! The skilled professional successfully executes the branching and carries out the appropriate task isomorph. Item isomorphs can be generated once the branching is understood.

Automatic item generation can be used with items that call for comparative judgment, although the phrase item assembly may be more apt. In personality assessment, pairwise preference items and multidimensional forced choice items are groups of two or more statements, which a candidate must rank or choose between. Item models are available that allow for on-the-fly assembly of pairwise preference or multidimensional forced choice items. Statements are sampled from pools representing the dimensions to be measured, and whole items are assembled from the sampled statements (Stark & Chernyshenko, 2007).

Notice that in this case, the item stem is fixed; only the response options vary from isomorph to isomorph. This differs considerably from the multiple choice format typically used for knowledge items.

There are other cases in which item assembly models may be used to select between a variety of response options, while the stem remains fixed. This approach, Automatic Response Option Sampling, applies to a variety of intrinsically multidimensional multiple choice items, such as Situational Judgment Items (SJIs). SJI stems use words, images, video or other media to present job-relevant situations which call for professional judgment. An example might be an emergency that arises on the job. A candidate may then be asked to choose the "most effective" or "most like me" out of several possible responses to the situation. Under an Implicit Trait Policy (ITP) model (Motowidlo, Hooper & Jackson, 2006), the response options each reflect the contextual expression of a personality trait, work style or preference. A given stem may not have an objectively correct response, but nevertheless, subject matter experts may be able to agree on the relative effectiveness of each of a set of responses. However, even a small change to the presented situation may have a big effect on the effectiveness ratings. Because the experts' judgment rests on a complex process, swapping out response options is easier and more robust than attempting to model the whole item.

Other examples of items suitable for Automatic Response Option Sampling include some cognitive diagnostic items, and, as a special case, multidimensional forced choice items.

Although oftentimes item isomorphs are not presented together on the same test form for reasons of content balance or clue-ins, it is not a hard and fast rule that there must be at least one item template for each item to be presented on a test form. Consider a template for non-word problem single-digit addition:

A form composed of all one hundred possible isomorphs, presented in random order, makes an effective speed test for a second- or third-grader.

The concept of assessment engineering (Luecht, 2008) has been introduced to describe a process of test development that begins from a model of cognitive development that encompasses all item content to be tested. This all-encompassing model is necessarily more complex than a model of a particular item family.

In certification and licensure, such a model might include independent axes for the development, from novice to master level, of knowledge and practical skills in several job-relevant areas -- coupled with a population model that acknowledges the frequent co-development of knowledge and skill among working professionals.

Such a model, a construct map, must account for skills which are neither developed nor used at the novice level but are introduced later, as well as skills which are more related to each other at some levels than others. These discontinuities and dimensionality changes are common in professional development. I'm told that among pastry chefs, the production of chocolates is not taught to novices until they have proven themselves in other areas -- usually with years of work. Novice airplane pilots must be certificated under VFR conditions (that is, they can see where they're flying) before they begin training for IFR conditions (flying on instruments, such as at night or in bad weather). Managerial, financial, and business skills are introduced at mid-levels of development in many technical and skilled professions.

Once this high-level model, known as a construct map, is constructed, task and evidence models relate the levels of knowledge and skill to specific demonstrations that may be expected of a novice, mid-level practitioner, expert or master on various trajectories through the construct map. Item templates and item families nest within the construct map, and are supported by task and evidence models. For example, the stem of an item might be derived from a task model, while the relationship between an evidence model and the response options or scoring rubric define the expected level of difficulty of the item.

A related notion occurs in games used for assessment. Electronic game design has always resembled assessment in a sense. Leveled games attempt to match tasks to the demonstrated mastery of the player, so that a player is challenged, but not overwhelmed. Some game types, such as RPGs (role-playing games) also attempt to maintain a coherent story, or similar entertainment elements; this requires the presentation of varied challenges of similar difficulty level, in a structured manner. The mastery which the player demonstrates may be multidimensional, with the same propensity for discontinuities as a profession; the game begins to resemble an operationalized construct map.

Recently, the testing industry has begun to use the reverse relationship, and build games with the purpose of assessing, rather than entertaining, individuals. Gamified tests can be high-fidelity simulations of a professional context, or may be set in fictional worlds which are conducive to testing an abstract or academic construct, such as knowledge of Newtonian physics. However, the deep mapping of items and responses to tasks and evidence, and up to a map of construct space and development trajectories, is necessary to produce an engaging game and a coherent set of scores.

Uses and Benefits
The original intent of automatic item generation, and of assessment engineering, was to transition the traditional test development process into one with greater up-front cost but considerable economies of scale. Other uses and incidental benefits have emerged as operational programs have adopted generative methods and/or assessment engineering.

One such benefit occurs in tests that must be produced in more than one language. Traditionally, items were produced individually, and then laboriously translated into each additional language, with a strict quality control process involving multiple translators, psychometricians, and content experts in each language. The pilot testing and calibration processes, with their associated expenses, were multiplied by the number of languages involved.

In a generative system, however, an item family can be translated as a single unit. Consider, for example, our original phrasal template item from earlier in this white paper:

To translate the item into Spanish, for example, only the stem and the object triplets need be translated; the syntax is as consistent in one language as the other, although the order of variablized elements (blanks) and structural words is not exactly the same in Spanish as English:

It is worth noting that the reading level of items generated from translated substitutions is not guaranteed to be equivalent. A translation process for an item family should involve careful evaluation to make sure all the possibilities in the new language are appropriately difficult, not obscure, and not linguistically implausible -- for example, in our template, because the encompassing category of a triplet must be made so much more general in the target language as to not be a meaningful category at all.

Cognitive or process models of complex items can partially or completely replace translation of items with the production of functionally equivalent items, that is, members of the same family, in other languages or with other localizations:

What country do you live in? Canada!

¿En qué país vive ustéd? ¡España!

In this simple example, when the item is exported from Canada into Spain, both the language and the correct answer change, although the item is essentially the same.

Consider, for another example, an English-language SJI wherein an American doctor must help the family of a palliative care patient to make a difficult medical decision. Appropriate dialogue and nonverbal cues are used to establish the situation in detail. For a doctor in Japan, testing in Japanese, the most appropriate dialogue will not be a direct translation of its English counterpart, and the nonverbal cues and social interactions described may be considerably different. (Of course, the relative value of the response options, or the scoring rubric, might differ as well; validity would need to be established in Japan. An assessment engineering model for a situational judgment test spanning languages and cultures would need to represent substantial complexity.)

Gierl, Fung, Lai and Zheng (2013) described a cognitive model developed for the purpose of generating items for surgeons in English and Chinese, related to the diagnosis and treatment of hernias. In this case, the target language was implemented as a substitution that applied to the whole test. Not only did the language of presentation vary, but the number of valid phrasings for certain problem features differed between English and Chinese. The underlying cognitive process, knowledge, and skill were unaffected.

Another emergent use of automatic item generation is to adapt or assort test elements at a finer grain than the item level. The convergence of item generation systems with LOFT and CAT is discussed in the next section.

Ecosystem
Operationally, automatic item generation dovetails well with LOFT (Linear On-the-Fly Testing) and CAT (Computerized Adaptive Testing). In both LOFT and CAT, the items a candidate receives are independently selected for each individual candidate from a large pool of available items. The difference between the two is that in LOFT, the items are selected in accordance with a set of predetermined rules, but do not depend on any of the candidate's responses. In CAT, each item response helps to determine the items that will be subsequently administered. CAT can be used to administer shorter tests with the same measurement quality; LOFT forms look like traditional forms, except for being unique or close to it.

One perspective on CAT is that it avoids asking redundant questions of the candidate. If you are conducting a medical intake, and your patient has told you that he is male, your next question should not be, "Are you pregnant?" In the same way, it is uninformative to give a very easy item to a candidate you already know to be a high performer, or to give a very hard item to a candidate who has been making mistakes on the easy ones. CAT systems respond to information about the candidate to select informative, rather than uninformative, items.

LOFT and CAT systems are often constrained in order to balance item content between domains or problem types. Traditionally, in CAT, a compromise must be made between selecting the most informative set of items and meeting content balance requirements. When automatic item generation is used, however, isomorphs of a single item typically meet many or all of the same content balance constraints, while systematically varying in difficulty. The item pool is not just larger when item generation is used, but more distributed.

In multidimensional non-cognitive assessment, pairwise preference items, multidimensional forced choice items, or situational judgment items may be used along with an item assembly model, in order to allow a CAT to select for particular dimensions requiring additional measurement precision (Mangos & Hulse, 2016; Thissen-Roe & Gunter, 2016). This adaptation can occur below the item level, for added efficiency of testing.

Review
With the introduction of computers into test development and administration, it is no longer necessary to limit ourselves to a tradition of extensive effort in developing each individual item "from scratch." Item cloning, item templating, item assembly, response option sampling, and assessment engineering are just a few methods that stretch and multiply a merely sufficient item bank into a vast and rich one. Further, these methods join up well with flexible computerized administration methods such as LOFT and CAT.

An investment of up-front test development effort, in order to produce item templates, response option pools or engineered assessments, can pay off with years of sustainability and scalability for your testing program.

If you would like to learn more about Automatic Item Generation and Assessment Engineering, please join us at the Annual Meeting of the Society for Industrial and Organizational Psychology (SIOP), in Anaheim, CA. The symposium "Modeling Item Characteristics for Automatic Item Generation" will be held on Friday, April 15th, 2016, at 8:30 am, in room 201D of the Anaheim Convention Center.


References

Bejar, I. I. (1996). Generative response modeling: Leveraging the computer as a test delivery medium (RR-96-13). Princeton, NJ: Educational Testing Service.

Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407-433.

Gierl, M.J., Fung, K., Lai, H. & Zheng, B. (2013). Using automated processes to generate test items in multiple languages. Paper presented in the symposium "Advances in Automatic Item Generation," at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA.

International Personality Item Pool. http://ipip.ori.org/

Luecht, R. M. (2008). The application of assessment engineering to an operational licensure testing program. Invited paper presented at the Association of Test Publishers Annual Meeting, Dallas.

Mangos, P.M. & Hulse, N. (2016). Real-time assessment construction for predicting effectiveness in human-machine teams. Paper to be presented in the symposium "Modeling Item Characteristics for Automatic Item Generation," at the Annual Meeting of the Society for Industrial and Organizational Psychology, Anaheim, CA.

Motowidlo, S.J., Hooper, A.C. & Jackson, H.L. (2006). A theoretical basis for situational judgment tests. In J.A. Weekley & R.E. Ployhart (Eds.), Situational judgment tests: Theory, measurement, and application (pp. 57-82). Mahwah, NJ: Erlbaum.

Social Security Administration (2015). Popular baby names. https://www.ssa.gov/OACT/babynames/

Stark, S. & Chernyshenko. O. S. (2007). Adaptive testing with the multi-unidimensional pairwise preference model. In D. J. Weiss (Ed.), Proceedings of the 2007 GMAC Conference on Computerized Adaptive Testing. http://www.psych.umn.edu/psylabs/CATCentral/

Thissen-Roe, A. & Gunter, S. (2016). Automatic response option sampling for situational judgment items. Paper to be presented in the symposium "Modeling Item Characteristics for Automatic Item Generation," at the Annual Meeting of the Society for Industrial and Organizational Psychology, Anaheim, CA.