Exam Registration: 1-800-947-4228
Apply to Become a Testing Center!

White Paper Series: Current Issues in Test Development

View PDF

Nuisance Multidimensionality: Practical Approaches, Solutions and Opportunities

Anne Thissen-Roe, Ph.D.
June 22nd, 2016

In order to license a professional for practice in a certain jurisdiction, or to certify a professional as meeting certain standards, it is usually necessary to evaluate several performance areas. These are KSAOs (Knowledge, Skills, Abilities and Other characteristics), which are job requirements that can be identified by a job analysis. A certification's application requirements might include a certain number of years of recent experience, completed education or training (a degree or certificate), a sufficient score on a written test of knowledge, and one or more demonstrations of hands-on or face-to-face skill in a practical test setting. The successful candidate shows competence in all areas.

Within such a system of requirements, each numeric test score which is compared to a pass/fail standard must be reliable enough to support the pass/fail decision. For written tests made up of many individual items, reliability can usually be achieved most efficiently by a unidimensional test. A unidimensional test measures one coherent element of job performance, such that candidates can be definitively rank-ordered. The idea is that a written exam should test one thing, and test it well.

However, as the original Thissen in testing recently pointed out, when you ask whether a particular, actual test is unidimensional, the answer is invariably, "No." (Thissen, 2016)

The perfectly unidimensional test is an ideal. Real tests have character: small clusters of extremely similar or closely related items, faultlines between question formats or content areas, old questions that some candidates have seen before, or that some novice candidates don't have the general life experience to answer. None of these help determine whether a candidate is qualified; they are nuisance multidimensionality.

Know It When You See It
What is nuisance multidimensionality, and what can be done about it? Chances are the first thing you'll notice if you have it is low internal consistency reliability, or low values of Cronbach's α; these are indices of measuring one thing well. Beyond that, multidimensionality has many faces. Here are some examples that allow you to visualize and recognize common situations: a catalog of symptoms.

Type 1: Integrated components. Sometimes tests are made up of multiple distinct components, such as a written test and a practical test, or a section of multiple-choice items and a section of free-response or performance items. It may seem desirable to allow very high performance on one section to compensate for not-quite-sufficient performance on another. This is done by adding up weighted scores on the sections, and then comparing the total to a pass/fail standard. This procedure is referred to as compensatory scoring. Whether it is a good idea depends on the specifics of the situation; I'll discuss this more in a forthcoming white paper. For now, let me just say that scores on the components often correlate only moderately. In that case, they are measuring different elements or dimensions of candidate performance, and the resulting combined score is therefore not unidimensional.

Imperfect correlation of component tests is the least surprising, and the easiest to visually diagnose, of all the types of nuisance multidimensionality covered in this paper. In Figure 1, some example plots are shown in which candidate scores on a practical test are plotted against those candidates' scores on the written test. The two-dimensional sprawl of each point cloud indicates lower correlation.

Figure 1. Scores on a written test and a practical test for the same certification may be strongly correlated (left panels) or only moderately correlated (right panels). It is harder to see the degree of correlation when one test is particularly easy (second row). The lower the correlation between the tests, the more they measure distinct dimensions of performance -- which may or may not be intended, or useful.

Type 2: Passages and scenarios. When items on a test come in obvious groups, such as sets of three or four questions associated with the same passage, scenario, or figure, it is likely that a candidate's performance is more consistent on questions within a group than between groups. There is a nuisance dimension of understanding that particular passage (or scenario or figure) that contributes to a candidate's correct or incorrect responses, and that is not aligned with performance on the remainder of the test. This is nuisance multidimensionality in its most classic form.

A test can have several item groups with associated nuisance dimensions. It can be the case that all items belong to a group, or that some items stand alone and others are grouped. Different groups of items within a test can be more or less aligned with the test as a whole.

Type 3: Structural, intended or balanced content areas. Another form of multidimensionality, which may not be strictly nuisance multidimensionality, occurs when a test encompasses several content areas. These content areas might be the major domains of job responsibility as reported at the end of a job analysis. The test developer then structures the test blueprint around those domains, and balances their measurement according to their relative importance in job performance, so that the test reflects all of the facets of the job.

However, balanced measurement across several content domains still represents a departure from perfect unidimensionality. The test is not measuring one thing, but several. You can think of a content-balanced test as having domain-specific subtests pointing in somewhat different directions, and the test as a whole as pointing in (or measuring along) a compromise axis.

Mathematically speaking, each domain has its own measurement vector, which sets the direction and relative scale of a perfect domain score; actual (non-perfect) scores are proportionally shorter vectors in the same direction. The whole test has a measurement vector which is the vector sum of the domain vectors.

This is different from Type 2 multidimensionality in that no item is expected to align to the main dimension of the test. The job itself is a combination of factors, and performance doesn't align to any one; the test should be the same. Figure 2 shows this distinction graphically.

Figure 2. Alignment of the intended construct with the primary dimension of subsets of items. When a test is composed of two or more structural content areas, its measurement vector does not, and should not, align with any particular content area. It is a compromise vector.

Simply matching the number of items in each domain to the domain's relative importance does not guarantee that the score will depend on each domain in the intended proportion. For proportional item counts to result in a balanced score, the items must be similarly difficult across domains. Each domain may have some easy and some hard items, but those should also be distributed in a balanced way.

A common problem occurs when the knowledge required in some content areas is systematically more basic than in others. It is then nearly impossible to write items of similar difficulty in the basic and advanced areas.

Consider a domain structure consisting of four domains:

  1. Safety and Regulatory Compliance
  2. Theory
  3. Operations
  4. Ethics and Professionalism

This general structure occurs frequently in certification, as do several minor variations (e.g., versions with multiple technical content areas, or extensions that add domains at the supervisory and managerial levels). It is a synthesis of many domain structures, and should not be used as a substitute for a proper analysis of your profession.

Certifying authorities have a responsibility to ensure public safety, and so minimally competent performance for a certification often has a strong safety element. Safety often goes hand in hand with regulatory compliance, because regulations often address public safety issues as well. In our example, Domain 1: Safety and Regulatory Compliance really addresses minimal competence to practice. I've heard certification board members, looking at such a test, comment, "None of our candidates should get any of these questions wrong!" And indeed, the items are often quite "easy," in that nearly all candidates get them right. (They are not necessarily "easy" in another sense; I, as a non-member of a given profession, cannot guess the answer.)

On the other hand, it's common for certifying bodies to produce much harder tests of theoretical knowledge and/or operational procedure. Domain 2: Theory, in our example, can generally be described as "what gets taught in a formal training or school setting." Practitioners in our hypothetical profession accept that the minimally competent candidate knows "most" of what falls under Domain 2, while knowing "all" of it is reserved for expert or overqualified candidates. Therefore, it doesn't bother our hypothetical board, and its subject matter experts, to see candidates passing the test despite a score of, say, 60% in Domain 2.

Domain 3: Operations covers procedural and tacit knowledge often learned on the job. Across professions I've worked with, domains of this type encompass peripheral skills and tasks such as working with computers or office equipment, dealing with specialized software, jargon for record-keeping, interacting with business partners like suppliers and shippers, and troubleshooting various kinds of customer issues. These skills may be the sort that are picked up immediately (or you wash out), or they may be expertise slowly acquired over years -- or, in cases such as healthcare record-keeping or insurance billing, they may require laborious refresher study every few years due to changes in the field. This variation between jobs means that Domain 3: Operations may be a "basic" domain in one profession and quite difficult in another. In our example, it is a moderately difficult domain.

Domain 4: Ethics and Professionalism is typically a small part of the test, but considered essential to include. It isn't interchangeable with Domain 1: Safety and Regulatory Compliance, but often rests on a few thorny issues specific to a profession. Again, boards and subject matter experts hold a view that any passing candidate "should get all these questions right," which corresponds to the generation of "easy" items. The Ethics area may be even trickier than the Safety area to write difficult questions for, because while it's usually possible to ask about an obscure regulation and still have an objectively correct answer, difficult Ethics questions are prone to be debatable by experts.

Returning to our example, let us say that the domain weights and average scores are as follows:

  1. Safety and Regulatory Compliance: weight=40%, average=90%
  2. Theory: weight=30%, average=50%
  3. Operations: weight=20%, average=60%
  4. Ethics and Professionalism: weight=10%, average=97%

Because the candidates' scores are much higher, and vary considerably less, on Domain 1, the effective weight of Domain 1 is much lower than the intended 40%, while Domains 2 and 3 each figure much more prominently than intended in the total score. Figure 3 shows our example graphically. You can compare the intended content distribution, at the top right, with the effective content distribution, after subtest difficulty has been accounted for, below it.

Figure 3. When the items in some content domains (Domains 1 and 4 here) are much easier than others (Domains 2 and 3), scores on the test as a whole are more reflective of the harder domains.

Recall that you can think of a content-balanced test as having domain-specific subtests pointing in somewhat different directions, and the test as a whole as pointing in (or measuring along) a compromise axis. The compromise axis resulting from the intended content distribution is matched to the axis of job performance by the job analysis. The compromise axis resulting from the effective content distribution is "tipped" toward the difficult areas, Domains 2 and 3.

Type 4: Non-structural content areas. Content areas may be the domains that come out of a job analysis, but they may be unintended or artifactual as well. They may reflect individual difference variables in the candidate population rather than behaviors that occur on the job, or they may reflect job tasks that are so fundamental that they were not identified in the job analysis. As an example, there might be a dimension associated with all the items on a test that require reading a value out of a table, and another dimension associated with all the items that require reading a map.

Non-structural content area dimensions behave less like structural content areas and more like passage dimensions, except for being unexpected and therefore trickier to identify. Many items will typically have no associated non-structural content area dimension, because any underlying content they contain is essentially unique.

It's time for a word about differential item functioning, or DIF. DIF means that some (but, in practice, not all) of the items on a test have different measurement properties, such as difficulty or correlation with the total score, depending on which of two or more groups a candidate is from. For example, a particular item is much easier for candidates who come from one group than another. Usually, when we talk about DIF, the groups are protected groups and DIF is a particularly scary finding. It's one thing to discover that you've been inadvertently measuring the ability to read a map; it's entirely another to discover that you've been inadvertently measuring ethnicity!

That all said, DIF is a form of multidimensionality due to non-structural content areas, contextual knowledge, associations or beliefs which differ between groups. "Associations or beliefs" aren't usually tested in certification, but again, DIF is characterized as unintentional and unwanted. Knowledge is sometimes an inappropriate or ethnocentric description in this case. I recall a long-ago example in which a verbal analogy item depended on associating bananas with yellow. Some candidates came from areas where bananas were typically eaten green. It isn't "knowledge" to eat your bananas the way the test developer does.

DIF, along with the methods used to isolate the items causing it, applies to differences between groups other than protected groups -- and need not always hold a stigma. Consider, for example, isolating the items that are particularly difficult for those students who got their training from a particular school; that might lead to useful feedback for the instructors. For another example, applicants for a civilian license who had held its military equivalent might have a near-perfect collective record on some items; perhaps a military credential could earn a candidate a shorter test. In its pure math sense, DIF is just a way of identifying a secondary dimension when the candidates cluster in two categories on that dimension.

Another non-structural content area with implications for fairness is reading comprehension. If your candidates are a mix of native speakers and language learners, or have widely varied educational levels, and the profession being tested doesn't require reading sophisticated passages, you should probably avoid making your items challenging to read. Sometimes, however, it is inevitable that some items are more reading-intensive than others; those items are likely to form the high end of a non-structural factor.

So far, I've mostly discussed dimensions of a test in terms of spacing that occurs between items, as well as between candidates. When items are polytomous, that is, scored in more categories than just "right/wrong," they can be intrinsically multidimensional. Usually, intrinsic multidimensionality in polytomous items is unintended and non-structural. An example is response sets in Likert-type items, in which some candidates tend toward using more high (or low) categories; some use the highest and lowest categories freely while others tend to the middle. In another example, raters using a rubric or anchored holistic scale to score a free-response written item, performance item, or practical test may end up using one of the possible scores almost exclusively for a particular mistake. Except in expert evaluation, that mistake does not necessarily fall "between" the response behaviors leading to adjacent scores.

Intentional intrinsic multidimensionality in polytomous items occurs primarily in personality testing (multidimensional forced choice items or MFCs; some kinds of situational judgment items or SJIs) and cognitive diagnostic testing, and is generally out of scope for a certification testing paper. I will discuss the issues related to practical tests more thoroughly in a forthcoming white paper. If you are working with a multiple-choice test with each item scored right/wrong ("dichotomous items"), there is no need to worry about intrinsic multidimensionality at all.

Type 5: Question types. Even among dichotomous items, not all question types and formats are created equal. For any given item format and style of questioning, some candidates will find it easier to demonstrate their actual underlying knowledge than others. This is "method variance."

Some question type effects are better characterized as non-structural content dimensions, involving underlying skills such as map-reading. However, others have to do with skills that are more specific to taking the test. For example, particularly on their first encounter, many candidates found the analogy and antonym questions on old versions of the SAT to be non-intuitive, and easy to inadvertently answer entirely backwards -- rather like the game show Jeopardy, with its instructions to "answer in the form of a question."

Recognizable question type effects can occur in certification and licensure testing as well, especially where a question is designed as a low-fidelity simulation of a problem that would actually occur on the job, rather than a question about general background knowledge.

Question type effects typically act in the form of hindering some candidates, particularly first-time candidates. This hindrance can often be eliminated with practice. On some tests, it may account for a substantial fraction of the "retest effect" that causes candidates to improve their scores on their second attempt, even without additional study.

It is often difficult to disentangle the effects of question types and formats from the content they cover, because a question type is often only applicable to one domain of content, or is tied to a non-structural content area such as map-reading or chart interpretation. If the same format of question can be used across content areas, the common method variance it introduces to all area-specific scores can artificially inflate the correlation between those scores. More likely, though, the question types are specific to structural and non-structural content areas, and the method variance contributes to the specific dimensions.

Type 6: States and temporal factors. Here's a concept that isn't always obvious to, or fully understood by, those of us with our primary training in psychology: the dimensionality of the test is not always the dimensionality of the tested constructs. Tests can be more unidimensional than their constructs, due to common method variance, or more multidimensional within the same construct space (as in the case of different question types for different content areas). They can also have entirely extra dimensions, artifact dimensions.

Artifact dimensions aren't just unintended measurement of real individual differences. They're introduced by the test itself. Here is an example of how an artifact dimension -- actually, more than one -- can emerge from the sequential position of items on a test.

Consider a long and arduous hypothetical certification test. Candidates have eight hours of seat time over the course of the day: four hours before lunch to complete the first 150-question section, and four hours after lunch to complete the second section, also 150 questions long. Let us simplify our case somewhat by pretending we have a perfectly unidimensional test to begin with, because the other forms of multidimensionality we've discussed aren't actually relevant in the development of our artifact dimensions.

Now let's pull a candidate from each of three "groups" defined by sets of certification-irrelevant personal characteristics. (These groups don't have strict membership; one can argue about where to draw the lines between members and non-members.)

Candidate #1 is a night owl, slow to wake up in the morning, and is not performing at her best on the first questions.

Candidate #2, by contrast, is caffeine-dependent, and does her best at the beginning of the test, right after she finishes her coffee. By the end of each section, she's lost her buzz, and is worn out and sluggish.

Candidate #3 is alert throughout the morning, but eats a big lunch and finds herself feeling sleepy for a part of the early afternoon.

Let's assume, reasonably, that feeling alert makes each candidate respond a little better and a little faster -- not a lot, but enough to make the difference between "a little above average" when alert, and "a little below average" when sluggish and sleepy. (This degree of effect, by the way, is entirely compatible with generally accepted levels of internal consistency and test-retest reliability.)

Figure 4 shows simulated response patterns for each of these three candidates, along with their alertness profiles. Candidate #1 makes disproportionately many mistakes on the items positioned early in the first section, and Candidate #3 makes disproportionately many mistakes on the items positioned early in the second section; Candidate #2 makes fewer mistakes at those times and more mistakes at the end of each section. When we, test developers, turn around and look at the resulting dataset, the Candidate #1s and #3s of the world tell us the early questions are harder on the first and second sections respectively, and the Candidate #2s tell us those questions are easy and the late questions are hard. This is a DIF-like individual difference variable that is completely disconnected from content.

Figure 4. Three simulated candidates with different patterns of alertness over the course of a full-day test, administered before and after lunch. Alert candidates respond faster and more accurately; sleepy or worn out candidates take longer, make more mistakes, and are more likely to run out of time.

Think shuffling the items into different sequences for different candidates -- at least across a few different pre-generated test forms -- makes this problem go away? Well, you might balance out any overall difficulty effects, and you aren't being unfair to, say, an entire minority of caffeine-dependent candidates. Further, this is actually the best case for diagnosing a sequence-dependent performance effect, if you are looking for one.

However, if you don't look at position effects, and you have this compound of candidate variation and form variation, the first thing you'll see is low item discrimination parameters (if using Item Response Theory) or low item-total correlations (if using Classical Test Theory), which don't seem to be due to the usual indications of multidimensionality. This is because, while each group of candidates sees each item at an average alertness level, individual candidates are still perceiving individual items as harder or easier depending on their mental state. There are dimensions of the test which are associated with different items depending on the item ordering.

Unfortunately, some relatively complex modeling, done on a large sample, is needed to tease out three (or more) different patterns of alertness effects in the same candidate pool. It is more useful, in this case, to be aware that alertness-fatigue effects, and other state effects, can occur, and can interfere with the measurement quality of a long test, than it is to dig for those effects after the fact.

Type 7: Long-term temporal factors. Artifact dimensions can come of the passage of time between administrations, as well as within an administration. One example of this occurs when some items become out of date; another occurs when the security of exam content is compromised.

Let's say we have a set of items that become obsolete through technological (rather than regulatory) change. They become out of date by degrees, as the new technology supplants the old one out in the field, as well as in training programs.

During the transition, it is likely that many candidates will remain familiar with the old technology, even if they are presently using the new technology, and will still know how to answer questions about it. However, as the old technology entirely disappears, a population of candidates emerges that have never used it, and didn't learn it in school. Eventually, candidates who did at one time use the old technology may forget relevant details. (I can no longer explain how to take a credit card imprint, let alone describe the full procedure for such a transaction. These days we have chip readers and multifunction PIN pads and Square.)

We now have a set of items that all depend on knowledge of obsolete technology of a recent era. The items may reference a single technology or several technologies that were replaced over approximately the same period. If we analyze the data collected from candidates over the last few years, knowledge of the technology shows up as a secondary, scenario-like dimension, with high scores tending to occur among earlier candidates, and/or candidates with more work experience.

Most cross-cohort psychometric analyses don't directly affect candidate scores. However, within a cohort during the transition period, the same factor appears, associated with experience. Novices are disadvantaged due to unfamiliarity with a technology they're not likely to ever use on the job, and so their performance is underpredicted; they may appear unqualified when actually they do have sufficient qualifications.

An obvious, but often labor-intensive, way to avoid these problems is to review your item bank frequently (e.g., annually) for items that have become out of date, or have changed in difficulty with the advent of new technologies and procedures.

Essentially the opposite effect can occur when the security of exam content is compromised. The compromise might be sudden, as in the release of some or all exam items on the internet, and reach many future candidates. Alternately, it might be a slow and subtle erosion of security, as when candidates who have taken the test discuss it on message boards. Even if no such candidate posts an entire item, enough "hints" may accumulate that later candidates come in with some useful preknowledge. Retesting candidates, too, have preknowledge stemming from their earlier attempts.

Those later candidates find the questions easier. Or, one might say that there are now two individual difference dimensions that contribute to the test score: professional knowledge, and test preknowledge. Population levels of preknowledge increase over time, which causes upward drift in test scores, and downward drift in item difficulty parameters.

In many cases, reasonable test security measures and regular replenishment of item content can limit subtle preknowledge effects. However, with some large-scale, high-stakes exams, essentially the whole test is regularly reconstructed online within days. There are even cases where candidates in early time zones inform candidates in later time zones, within a single day (Carson, 2013). The time scale of this effect may be quite short.

Some Useful Tools
So you've got nuisance multidimensionality, of one of the seven types listed above. Now what? Here are some useful tips, tricks, and tools for dealing with tests that aren't quite as unidimensional as you'd like, ordered roughly by the effort involved.

1. Just ignore it. In some cases, you can just proceed with a unidimensional model even though mild nuisance multidimensionality is present. It's best to assess the degree of multidimensionality present, and the degree of impact expected on each of your use cases, before deciding that you'll be okay.

There's more than one way to assess the degree of multidimensionality. You might look at the correlations between your domain scores, and the reliability of each domain score. You might compare a unidimensional IRT model to a multidimensional IRT model, or you might have your friendly neighborhood psychometrician do that for you. You might have a pretty good guess to begin with, from knowing something about the relationships between the domains you're measuring.

Or, at the other extreme, you might be running a small program, where you suspect the presence of nuisance multidimensionality, but you don't have enough data to pin it down completely. At some point, if you're only testing thirty candidates a year, you're going to get your test as fixed as you can. Then you keep calm and carry on.

If you are seeking accreditation, you may need to explain what you are doing to make your scores reliable and valuable; you may also need to explain any uncertainty you have about the dimensionality or other measurement properties of your test. However, unidimensionality is not itself a requirement for accreditation.

It's also possible to sometimes ignore nuisance multidimensionality, for some purposes. Some test development applications are much more vulnerable to violations of assumptions (like nuisance multidimensionality) than just plain comparing a candidate to a cut score. At borderline levels of multidimensionality, you may have to worry about it when you're establishing the equivalence of parallel forms, but not for scoring each candidate. (This is actually an unspoken assumption underlying how we write content specifications.)

If you think you might need to sometimes pay attention to multidimensionality, or you're not sure, you should give your friendly neighborhood psychometrician a call. Mixing models is not DIY test development.

2. Multidimensional Item Response Theory (mIRT)may be helpful under some conditions. Specifically, it can help with nuisance multidimensionality of types 2, 3, 4, and 5, if you have sufficient test volume to set it up.

If you're already using item response theory (IRT) for scoring or development purposes, you can make use of multidimensional models to "sort out" contributions from nuisance dimensions, and get back to the business of scoring your candidates on one primary dimension.

If you're not using IRT, but you have high test volume (e.g., hundreds or thousands of candidates per year), it's worth considering. There are a number of ways in which IRT makes test development and administration more robust or convenient, of which dealing with multidimensionality is only one. However, you'll want to consider your testing context carefully, and you probably don't have the necessary software on hand. Give your friendly neighborhood psychometrician a call.

If your test volume is very low (a few dozen candidates per year), IRT is not likely to be useful to you. You can safely skip to section 3.

Scoring candidates. Under item response theory, scoring is not accomplished by adding up item scores. It is done, instead, by inferring a candidate's latent trait level. On a knowledge test, the candidate's latent trait is her actual knowledge, not her performance; it doesn't change depending on whether she is presented with an easy form or a hard one. IRT scoring methods inherently compensate for item properties like difficulty, using an item model.

Thus, a score for a candidate amounts to a prediction of that candidate's true position, as compared to all other candidates, on a standard scale.

The statistical model needed to accomplish scoring under item response theory can be thought of as having two components. First, there are models of individual items. Item models come in varying shapes and complexities, but they are all mathematical functions for calculating the probability of a particular item response (e.g., the correct response), given one or more trait levels.

("But that's backwards!" you say. "I know which items the candidate got correct, and I need her score." Yes; that's where the inference part comes in. We use Bayesian inference to predict the score from the item responses, given models that point the other direction.)

Second, item models are nested in a model of how the latent traits are distributed in the candidate population. How many traits are there to be measured? Are some of the traits correlated? The answers define a trait space.

Unidimensional IRT models are pretty straightforward. There's one latent trait (e.g., "knowledge"). The candidates' knowledge levels are, absent counterevidence, usually assumed to be distributed normally (that is, in a bell curve). The peak of the curve might be above or below the cut score, depending on how hard the test is, but in a good test they aren't too far from each other.

The simplest and most common family of unidimensional item models are S-shaped curves in which higher knowledge levels lead to higher probabilities of getting the item right. There are diminishing returns to knowing a very large amount, and diminishing returns (of a sort) to knowing less -- that is, you can't get the item more right than right, or more wrong than wrong, so the relationship between knowledge and item score flattens out when it gets close to a probability of 1 of a correct answer, or 0 of a correct answer (1 of a wrong answer).

To model a multidimensional test, the population model needs to have more than one dimension, or trait. The item models may still be plain S-curves, but they can be oriented along more than one vector in the trait space. This sounds more complicated than it is. The items just now "point" in a direction that isn't necessarily "more knowledge." In a bifactor model, for example, each item has a measurement vector that is a combination of "more knowledge" and one specific dimension (e.g., one scenario or one domain, but not both).

Here's where it gets good: When you score a candidate using a multidimensional IRT model, you obtain a score on each trait. That means that you have a usable score on your primary dimension (e.g., "knowledge"), regardless of what's going on in the other dimensions. It doesn't matter which scenarios the candidate saw. It doesn't matter whether the candidate is stronger in Domain 2 or Domain 3. Multidimensional IRT scoring automatically removes the contaminant variance.

Evaluation of dimensionality. If you model your items and candidates in multiple dimensions, you can use that model to determine how strong your measurement in your primary dimension is, relative to your nuisance dimensions. This can help you determine, during test development or maintenance, whether you are in a situation where you can ignore your multidimensionality, and/or for what purposes.

Form assembly and equating. IRT scores are form-independent, because the scoring process inherently compensates for the properties of the items administered. This simplifies the process of comparing scores across forms. On the other hand, you can use the modeled item properties to make sure your content balancing is working as intended.

Intrinsic multidimensionality. Intrinsically multidimensional polytomous items (Type 4) need intrinsically multidimensional polytomous item response models. Psychometricians, industrial-organizational psychologists, survey researchers, and a host of other scientists have spent decades trying to find ways to simply code, key, and add up responses to situational judgment items (SJIs) and Likert-type items, only to find that those methods do not perform well for multiple-trait SJIs, or in the presence of response sets. Intrinsically multidimensional polytomous item response models may be as unwieldy as their name, but they do get the job done.

(Alternatively, you can revise the items. See section 5.)

3. Compensatory scoring. If you have structural domains, and you want to calculate a single score across all of them, you may want to start by calculating each domain score separately, particularly if they are not strongly correlated. You can then calculate the total score as a weighted sum of standardized domain scores. Then, you can establish the measurement quality (and unidimensionality) of the item sets for each domain, and infer the properties of the sum. Maintenance tasks occur at the domain level.

(If you aren't using the domain scores for reporting or record-keeping, and you aren't using IRT for any of it, you can simplify the scoring process to one step by calculating a weighted sum across items, where you use the same weight for each item within a domain.)

The sum-of-domain-scores approach works well for well-measured but poorly-correlated domains. It is an easy fix for inadvertent under-weighting of domains that tend toward easy items. However, it makes it harder to get away with measuring any domain weakly. Or: "No, you shouldn't give an Ethics section with five easy items on it and call yourselves done." The need for accurate and reliable measurement of subdomains used in scoring is called out in the National Commission for Certifying Agencies (NCCA) 2016 Standards for the Accreditation of Certification Programs (Institute for Credentialing Excellence, 2014):

Standard 20, Commentary 2: "If a program makes decisions using domain-level information, it should demonstrate that the reliability of that information is sufficient and provide a rationale for how it weights and uses domain-level information."

4. Non-compensatory scoring. Let's say you have a test with integrated practical and written components that don't have a solid correlation between them (Type 1), or a knowledge test covering such distinct domains that they don't correlate strongly (Type 3), and yet the measurement properties of each component or domain subtest are good. A question you should be asking yourself is whether you actually want to add up the scores prior to comparison to a pass/fail standard, or whether you want to evaluate and/or report more than one score. You may be better off with separate requirements.

I plan to discuss this issue, particularly as pertains to integrated components, in much greater detail in a forthcoming white paper.

5. Change the test. The highest-effort response to multidimensionality is to change something about the test. Not only are these options labor-intensive, they're the only ones that can't be done "on the back end," after the test has been administered. These changes must be prospective.

Changing the test might simply entail refreshing and replenishing the item pool, for example to phase out obsolete or compromised items (Type 7) or items that exhibit differential item functioning (DIF) between protected groups (Type 4). This is straightforward maintenance: item writing and form assembly, according to your existing blueprint. Sadly, "straightforward" does not mean "effortless."

If you have question type effects (Type 5), you might alter your blueprint to specify the proportions of different question types, or to remove a question type by way of restricting new item writing, and replacement of the troublesome items.

Non-structural content areas (Type 4) that amount to foundational skills, such as map-reading, require a similar approach. However, good practice also calls for making sure that any skills tested, such as map-reading, are required to perform with minimal competence on the job. If the foundational skill (e.g., reading comprehension) is not required for the job, or not required at the same level for the job as the test, it is probably better to eliminate and replace the items.

Unintentional intrinsic multidimensionality (Type 4), for example in a rubric-scored performance item, is also likely best addressed by revising or replacing the item -- at least, if you can't reasonably ignore the multidimensionality. Because such items are complex, and their development process is lengthy and laborious, revision is likely to be useful, and an attractive option relative to replacement.

Finally, if you are seeing serious fatigue or time-of-day effects (Type 6), there is really no good way to "score around" them. If you can't ignore them, your options involve changing the format, length, or scheduling of the test. For example, the eight-hour, three-hundred item test could be given in four sessions of seventy-five questions each, perhaps over the course of two days. Or the test could be shortened to a fraction of its length, using methods such as computerized adaptive testing (CAT) or verification testing of candidates scoring just below the passing threshold (Wainer & Feinberg, 2015).

It's Not a Bug, It's a Feature
Nuisance multidimensionality may not always be a nuisance. Here are some contextual silver linings, or cases where those secondary dimension scores might be desirable information.

1. Subscore reporting. You may wish to provide candidates with feedback on their domain-specific performance, for example in order to allow failing candidates to prepare themselves more effectively prior to re-applying. I alluded to this use earlier, but haven't talked about it in as much detail as it deserves.

It is important to ensure that domain-level subscores are adequately reliable to be reported. As stated in the National Commission for Certifying Agencies (NCCA) 2016 Standards for the Accreditation of Certification Programs (Institute for Credentialing Excellence, 2014):

Standard 19, Commentary 4: "If domain-level information has low reliability, programs are advised against reporting it to candidates and other stakeholders. When domain-level or other specific feedback is given to candidates, the certification program should provide estimates of its precision and/or other guidance."

It is also important to make sure that each domain-level score provides a worthwhile amount of additional information, above and beyond the overall score. It's entirely possible for domain-level scores to be highly reliable, but still a waste of the score user's time. This happens when the domain-level scores are strongly correlated with the overall score, or that they contribute extensively to the compromise measurement vector. (In a simple sum score, these are equivalent.)

The best way to ensure that there is unique information in domain-level scores is to isolate the orthogonal, domain-specific vector components of those scores. This is easily done if you're working from an mIRT model. If you're using classical scoring methods, you can still extract the vector components, using the correlations and a little trigonometry.

2. Rebalancing a test score. If you have multidimensionality stemming from structural content areas or integrated components, there are some cases in which you might prefer to change the balance of content areas.

You might want to correct an effective content imbalance stemming from item difficulty. You might want to adjust for items deleted from a form. These are simple cases, and the procedures for adjusting a score to reflect one domain more and one less are straightforward. But they can't be done without recognizing and understanding the multidimensional nature of the test.

You might have an external criterion or another test that you validate against. This is more common in non-certification settings, such as employee selection (job performance criteria) and education (other tests; last or next year's performance). These types of validation provide empirical targets for effective content balance, which shift over time and require adjustment. In some selection settings, it may even make sense to test several work-related characteristics, construct a multidimensional model, and then use multiple regression or a similar statistical technique to produce empirically balanced, criterion-valid scores.

Finally, you might conduct a job analysis and, based on the results, modify your blueprint. Usually, then you would replace at least some of your items, whether to measure new knowledge or skills, or simply because you maintain your test regularly anyway. Best practices call for aggregate comparisons to be made between candidates' scores and profiles across changes in the test, so that you can provide guidance on how to interpret new scores relative to old ones; if you don't, it's as if you are starting a whole new credential every time you do a new job analysis, and that throws away a lot of good value. One way to handle blueprint changes, with simultaneous item changes, is to use detailed, multidimensional score information to predict a candidate's scores on the new test. In this case, the dimensions might not be domains, if question types and/or foundational content are relevant to your candidates' performance.

3. Data forensics. A third application of score information on secondary dimensions, whether structural or incidental, occurs in model-based data forensics for the monitoring of test security. Very simply, data forensics methods that depend on models of candidate data, such as IRT models, perform better when they make use of more thorough and complete models. The better the candidate data is modeled, the less residual noise is left over, the more likely these methods are to pick up on aberrant patterns that require further investigation.

For data forensics purposes, it doesn't particularly matter where the multidimensionality is coming from -- unless it's due to item compromise. Secondary dimensions don't have to have sufficient reliability or uniqueness for individual reporting. Even fatigue effects may be useful, if they can be accurately modeled. Model-based forensics is thoroughly omnivorous.

Conclusion
The perfectly unidimensional test is an ideal you won't encounter in practice. Don't fear the multidimensionality; it's a nuisance at worst, and can even help you at times. Keep calm and carry on. And if you're still worried, or feel like you're in over your head, give your friendly neighborhood psychometrician a call. We'll be happy to sort out your extra dimensions.


References

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Carson, J.D. (2013). Certification/licensure testing case studies. In Wollack, J.A. & Fremer, J.J. (Eds.), Handbook of Test Security, pp. 261-283. New York: Routledge.

Institute for Credentialing Excellence. (2014). National Commission for Certifying Agencies Standards for the Accreditation of Certification Programs. Washington, DC: Author.

Society for Industrial and Organizational Psychology, Inc. (2003). Principles for the Validation and Use of Personnel Selection Procedures, Fourth Edition. Bowling Green, OH: SIOP.

Thissen, D. (2016). Bad questions: An essay involving item response theory. Journal of Educational and Behavioral Statistics, 41:1, 81-89.

Wainer, H. & Feinberg, R. (2015) For want of a nail: Why unnecessarily long tests may be impeding the progress of Western civilisation. Significance, 12:1, 16-21.