Examining Exams Using Rasch Models and Assessment of Measurement Invariance

Achim Zeileis
Universität Innsbruck 0000-0003-0918-3766 [email protected]

\Plainauthor

Achim Zeileis \ShorttitleExamining Exams \Abstract Many statisticians regularly teach large lecture courses on statistics, probability, or mathematics for students from other fields such as business and economics, social sciences and psychology, etc. The corresponding exams often use a multiple-choice or single-choice format and are typically evaluated and graded automatically, either by scanning printed exams or via online learning management systems. Although further examinations of these exams would be of interest, these are frequently not carried out. For example a measurement scale for the difficulty of the questions (or items) and the ability of the students (or subjects) could be established using psychometric item response theory (IRT) models. Moreover, based on such a model it could be assessed whether the exam is really fair for all participants or whether certain items are easier (or more difficult) for certain subgroups of students. Here, several recent methods for assessing measurement invariance and for detecting differential item functioning in the Rasch IRT model are discussed and applied to results from a first-year mathematics exam with single-choice items. Several categorical, ordered, and numeric covariates like gender, prior experience, and prior mathematics knowledge are available to form potential subgroups with differential item functioning. Specifically, all analyses are demonstrated with a hands-on \proglangR tutorial using the psycho* family of \proglangR packages (\pkgpsychotools, \pkgpsychotree, \pkgpsychomix) which provide a unified approach to estimating, visualizing, testing, mixing, and partitioning a range of psychometric models. The paper is dedicated to the memory of Fritz Leisch (1968–2024) and his contributions to various aspects of this work are highlighted. \Keywordsmultiple choice, item response theory, differential item functioning, psychometrics, \proglangR \Plainkeywordsmultiple choice, item response theory, differential item functioning, psychometrics, R \Address Achim Zeileis
Universität Innsbruck
Department of Statistics
Universitätsstr. 15
6020 Innsbruck, Austria
E-mail: ,
URL: https://www.zeileis.org/

1 Introduction

1.1 Large-scale exams

Statisticians often teach large lecture courses with introductions to statistics, probability, or mathematics in support of other curricula such as business and economics, social sciences, psychology, etc. Due to the large number of students and possibly also of lecturers who teach lectures and/or tutorials in parallel, it is often necessary to rely on exams and other assessments based on large pools of so-called closed (as opposed to open-ended) items, i.e., exercises which can be evaluated and graded automatically.

The most widely-used item type for such assessments are multiple-choice (also known as multiple-answer) or single-choice exercises. But due to the widespread adoption of learning management systems such as Moodle (Moodle), Canvas (Canvas), or Blackboard (Blackboard), especially since the Covid-19 pandemic, other item types are also being used increasingly. Often the evaluation is binary (correct vs. incorrect) but scores with partial credits for partially correct items are also frequently used.

1.2 Examining exams

Traditionally, mostly simple summary statistics have been used for the results from such large-scale exams, e.g., the proportion of students who correctly solved the different items and the number of items solved per student. However, recently there has also been increasing interest in so-called learning analytics which connects the results from different exams or assessments as well as covariates such as the field and duration of the study, prior knowledge from previous courses, etc., in order to better understand and shape the learning environments for the students (see Wiki+LearningAnalytics, for an overview and further references).

However, in Austria, to the best of our knowledge, it is still not common to apply standardized and/or automated psychometric assessments to exam results. A notable exception is the multiple-choice monitor at WU Wirtschaftsuniversität Wien introduced by Nettekoven+Ledermueller:2012; Nettekoven+Ledermueller:2014. In addition to various exploratory techniques they also employ probabilistic statistical models from psychometrics to gain more insights into exam results. More specifically, they use models from item response theory (IRT, Fischer+Molenaar:1995; VanDerLinden+Hambleton:1997), including the Rasch model (Rasch:1960) which is also employed in the analysis of international educational attainment studies such as PISA (Programme for International Student Assessment, https://www.oecd.org/pisa/).

1.3 Measurement invariance in IRT

Based on an exam’s item responses, IRT models can estimate various quantities of interest, most importantly the ability of the individual students and the difficulty of the different items (or exercises). A fundamental assumption is that the models’ parameters are invariant across all observations, which is also known as measurement invariance (see Horn+McArdle:1992, for an early overview in psychometrics). Otherwise observed differences in the items solved cannot be reliably attributed to the latent variable that the model purports to measure.

Typical sources for violation of measurement invariance in IRT models are multidimensionality (i.e., more than one latent variable instead of a single ability) or differential item functioning (see Debelak+Strobl:2024). The latter refers to the situation where the same item can be relatively easier or more difficult (compared to the remaining items) for different students, despite having the same latent ability.

1.4 Our contribution

In Section 2 we introduce a data set from one of our own mathematics courses, containing binary responses (correct vs. incorrect) from 13 items in an end-term exam of an introductory mathematics course for economics and business students. In Section LABEL:sec:irt the Rasch IRT model is briefly introduced, fitted to the data, and interpreted regarding the items’ difficulties and students’ abilities. Subsequently, in Section LABEL:sec:dif various methods for capturing violations of measurement invariance are applied: (1) Classical two-sample comparisons of two exogenously given groups along with modern methods for anchoring the item difficulty estimates. (2) Rasch trees based on generalized measurement invariance tests for data-driven detection of subgroups affected by DIF. (3) Rasch finite mixture models as an alternative way of data-driven characterization of DIF clusters. Section LABEL:sec:discussion wraps up the paper with a discussion and the epilogue Section LABEL:sec:epilogue concludes the paper by highlighting Fritz Leisch’s influence on different aspects of this work.

In all sections, emphasis is given to the hands-on application of the methods in \proglangR (R) – notably using the packages \pkgpsychotools (psychotools), \pkgpsychotree (psychotree), and \pkgpsychomix (psychomix) – along with the practical insights about the analyzed exam.

2 Data: Mathematics 101 at Universität Innsbruck

The data considered for examination in the following sections come from the end-term exam in our “Mathematics 101” course for business and economics students at Universität Innsbruck. This is a course in the first semester of the bachelor program and it is attended by about 600–1,000 (winter) or 200–300 (summer) students per semester.

Due to the large number of students in the course, there are frequent online tests carried out in the university’s learning management system OpenOlat (OpenOlat) as part of the tutorial groups, along with two written exams. All assessments are conducted with support from \proglangR package \pkgexams (Gruen+Zeileis:2009; Zeileis+Umlauf+Leisch:2014) which allows to automatically generate a large variety of similar exercises and render these into many different output formats.

In the following the individual results from an end-term exam are analyzed for 729 students (out of 941 that had registered at the beginning of the semester). The exam consisted of 13 single-choice items with five answer alternatives, covering the basics of analysis, linear algebra, and financial mathematics. Due to the high number of participants, the exam was conducted with two groups, back to back, using partially different item pools (on the same topics). All students had individual versions of their items generated via \proglangR/\pkgexams. Correctly solved items yielded 100% of points associated with an exercise. Items without correct solution can either be unanswered (0%) or have an incorrect answer ( $-25\%$ ). In the following, the item responses are treated as binary (correct vs. not correct).

The data are available in the \proglangR package \pkgpsychotools as \codeMathExam14W where \codesolved is the main variable of interest. This is an object of class \codeitemresp which is internally essentially a $729\times 13$ matrix with binary 0/1 coding plus some metainformation. In addition to the item responses, there are a number of covariates of interest:

•
\code
group: Factor for group (\code1 vs. \code2).
•
\code
tests: Number of previous online exercises solved (out of 26).
•
\code
nsolved: Number of exam items solved (out of 13).
•
\code
gender, \codestudy, \codeattempt, \codesemester, …

For a first overview, we load the package and data. Then we exclude those participants with the extreme scores of 0 and 13, respectively, because these students do not discriminate between the items (either none solved or all solved). The \proglangR code below employs the \codeprint() and \codeplot() methods for \codeitemresp objects by printing the first couple of item responses and visualizing the proportion of correct responses per item. {Schunk} {Sinput} R> library("psychotools") R> data("MathExam14W", package = "psychotools") R> mex <- subset(MathExam14W, nsolved > 0 & nsolved < 13) R> head(mex $s o l v e d) {Soutput} [1] 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 01, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1 [3] 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 00, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 [5] 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 11, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0 {Sinput} R > p l o t (m e x s o l v e d) Figure 1Figure 11Figure 11Barchartswithrelativefrequenciesofitemssolvedcorrectly(1,darkgray)o$