Name three different types of scores used in educational research and give an example of each
There are two types of test scores: raw scores and scaled scores. A raw score is a score without any sort of adjustment or transformation, such as the simple number of questions answered correctly. A scaled score is the result of some transformation(s) applied to the raw score
What is Measurement?
How do we deﬁne it?
This course focuses on cognitive and aﬀective test scores as operationalizations of constructs in education and psychology. As noted above, these test scores often produce ordinal scales with some amount of meaning in their intervals. The particular rules for assigning values within these scales depend on the type of scoring mechanisms used. Here, we’ll cover the two most common scoring mechanisms, dichotomous and polytomous, and we’ll discuss how these are used to create rating scales and composite scores.
Dichotomous scoring refers to the assignment of one of two possible values based on a person’s performance or response to a test question. A simple example is a use of correct and incorrect to score a cognitive item response. These values are mutually exclusive, and describe the correctness of a response in the simplest terms possible, as completely incorrect or completely correct. Most cognitive tests involve at least some dichotomously scored items. Multiple-choice questions, which will be discussed further in Chapter 3, are usually scored dichotomously.
Polytomous scoring simply refers to the assignment of three or more possible values for a given test question or item. In cognitive testing, a simple example is the use of rating scales to score written responses such as essays. In this case, score values may still describe the correctness of a response, but with diﬀering levels of correctness, for example, incorrect, partially correct, and fully correct. Polytomous scoring with cognitive tests can be less straightforward and less objective than dichotomous scoring, primarily because it usually requires the use of human raters with whom it is diﬃcult to maintain the consistent meaning of assigned categories such as partially correct.
Instead, he detailed two methods for combining scores across multiple rating scale items to create a composite score that would be, in theory, a stronger measure of the construct than any individual item. One of these methods, which has become a standard technique in aﬀective measurement, is to assign ordinal numerical values to each rating scale category and then calculate a sum or average across a set of these rating scale items.
The scaling technique demonstrated by Likert (1932) involves, ﬁrst, the scoring of individual rating scale items using polytomous scales. For example, response options for one set of survey questions in Likert (1932) included ﬁve categories, ranging from strongly disapprove to undecided to strongly approve. These were assigned score values of 1 through 5. Then, a total score was obtained across all items in the set, and low scores were interpreted as indicating strong disapproval and high scores were interpreted as indicating strong approval. This process could be referred to as Likert scaling. But in this course, we’ll simply refer to it as composite scaling, composite scoring, or simply creating a total or average score across multiple items.
Composites versus components
A composite score is simply the result of some combination of separate subscores, referred to as components. Most often, we will deal with total scores or factor scores on a test, where individual items make up the components. Factor scores refer to scores obtained from some measurement model, such as a classical test theory model, discussed in Chapter 5, or an item response theory model, discussed in Chapter 8. We will also encounter composite scores based on totals and means from rating scale items. In each case, the composite is going to be preferable to any individual component for a number of reasons.
From Physical to Intangible
With most physical measurements, the property that we’re trying to represent or capture with our values can be clearly deﬁned and consistently measured. For example, amounts of food are commonly measured in grams. A cup of cola has about 44 grams of sugar in it. When you see that number printed on your can of soda pop or ﬁzzy water, the meaning is pretty clear, and there’s really no need to question if it's accurate. Cola has a lot of sugar in it.
But, just as often, we take a number like the amount of sugar in our food and use it to represent something abstract or intangible like how healthy or nutritious the food is. A food’s healthiness isn’t as easy to deﬁne as its mass or volume. A measurement of healthiness or nutritional value might account for the other ingredients in the food and how many calories they boil down to. Furthermore, diﬀerent foods can be more or less nutritional for diﬀerent people, depending on a variety of factors. Healthiness, unlike physical properties, is intangible and diﬃcult to measure.
What makes measurement good?
In the last year of my undergraduate in psychology, I conducted a research study on the constructs of aggression, sociability, and victimization with Italian preschoolers (D. A. Nelson, Robinson, Hart, Albano, & Marshall, 2010). I spent about four weeks collecting data in preschools. Data collection involved covering a large piece of cardboard with pictures of all the children in a classroom, and then asking each child, individually, questions about their peers.
Progressing from nominal to ratio, the measurement scales become more descriptive of the variable they represent, and more statistical options become available. In general, the further from a nominal scale the better, as once the scale is designated it cannot be upgraded, only downgraded. For example, the variable age could be represented in the following four ways:1.number of days spent living, from 0 to inﬁnity;2.day born within a given year, from 1 to 365;3.degree of youngness, including toddler, adolescent, adult, etc.; or4.type of youngness, such as the same as Mike, or the same as Ike.
The ﬁrst of these four, a ratio scale, is the most versatile and can be converted into any of the scales below it. However, once age is deﬁned based on a classiﬁcation, such as “same as Mike,” no improvement can be made. For this reason, a variable’s measurement scale should be considered in the planning stages of test design, ideally when we identify the purpose of our test.
In the social sciences, measurement with the ratio scale is diﬃcult to achieve because our operationalizations of constructs typically don’t have meaningful zeros. So, interval scales are considered optimal, though they too are not easily obtained. Consider the sociability measure described above. What type of scale is captured by this measure? Does a zero score indicate a total absence of sociability? This is required for ratio. Does an incremental increase at one end of the scale mean the same thing as an incremental increase at the other end of the scale? This is required for the interval.