RASCH MODEL-BASED EVALUATION OF TOEFL LISTENING ITEMS: ANALYZING DIFFICULTY, DISCRIMINATION, AND FIT
DOI:
https://doi.org/10.52657/js.v11i2.2931Keywords:
Discrimination, Item Difficulty, Rasch analysis, TOEFL Listening, ValidityAbstract
This study analyzed TOEFL Listening Section items using the Rasch. Quantitative analysis of 200 participants revealed significant problems: (1) 7 misfit items (Infit MNSQ >1.5), particularly items of extreme difficulty; (2) 4 items with negative discrimination, indicating non-construct variance contamination; and (3) measurement gaps in the Wright Map (dead zones for low-ability participants and ceiling effects for high-ability groups). The findings confirm the structural weaknesses of the test design, recommending item revision, strategic additions, and redistribution to enhance validity and assessment fairness. This study underscores the need for psychometrically sensitive approaches in high-stakes language assessment.
References
Abdul Aziz, A., Jusoh, M. S., Amlus, M. H., Omar, A. R., & Awang Salleh, T. S. (2014). Construct Validity: A Rasch Measurement Model Approaches. Journal of Applied Science and Agriculture, 9(12), 7–12. https://www.researchgate.net/publication/266676182
Arfiani, Y., Susongko, P., & Kusuma, M. (2023). Construct validity analysis with messick validity approach and rasch model application on scientific reasoning test items. Thabiea : Journal of Natural Science Teaching, 6(1), 90–105.
Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research. Language Testing, 38(1), 6–40. https://doi.org/10.1177/0265532220927487
Boone, W. J., Staver, J. R., Yale, M. S., & Analysis, R. (2020). Rasch Analysis in the Human Sciences. Journal of Research Design and Statistics and Communicatiob Science, August 2019. https://doi.org/10.1558/jrds.37535
Buck, G. (2001). Assessing Listening. In Cambridge Language Assessment. Cambridge University Press. https://doi.org/DOI: 10.1017/CBO9780511732959
Chapelle, C. A. (2022). Argument-Based Validation in Testing and Assessment. In Sage Research Methods. https://doi.org/10.4135/9781071878811
Chapelle, C., & Lee, H. (2021). Understanding Argument-Based Validity in Language Testing (pp. 19–44). https://doi.org/10.1017/9781108669849.004
Dewi, H. H., Damio, S. M., & Sukarno, S. (2023). Item analysis of reading comprehension questions for English proficiency test using Rasch model. REID (Research and Evaluation in Education), 9(1), 24–36. https://doi.org/10.21831/reid.v9i1.53514
Elhambakhsh, E. (2024). The Role of Construct Validity in Designing English Language Assessment Tasks. Journal of English Language Teaching and Learning, 16(34), 55–78. https://doi.org/10.22034/elt.2024.61423.2638
Fan, J., & Knoch, U. (2019). Fairness in language assessment : What can the Rasch model offer ?. Language Testing and Assessment, 8(2), 117-142.
Forero, J., Vette, A. H., & Hebert, J. S. (2023). Technology ‑ based balance performance assessment can eliminate floor and ceiling effects. Scientific Reports, 0123456789, 1–11. https://doi.org/10.1038/s41598-023-41671-8
Futri, V. I., Rosnawati, R., Rahim, A., & Marlina, M. (2022). Rasch Model Study on Mathematics Examination Test Using Item Response Theory Approach. International Journal on Emerging Mathematics Education, 6(1), 29. https://doi.org/10.12928/ijeme.v6i1.21761
Gilakjani, A. P., & Sabouri, N. B. (2016). Learners’ Listening Comprehension Difficulties in English Language Learning: A Literature Review. English Language Teaching, 9(6), 123. https://doi.org/10.5539/elt.v9n6p123
Goh, C.C.M., & Vandergrift, L. (2021). Teaching and Learning Second Language Listening: Metacognition in Action (2nd ed.). Routledge. https://doi.org/https://doi.org/10.4324/9780429287749
Habibi, H., Jumadi, J., & Mundilarto, M. (2019). The rasch-rating scale model to identify learning difficulties of physics students based on self-regulation skills. International Journal of Evaluation and Research in Education, 8(4), 659–665. https://doi.org/10.11591/ijere.v8i4.20292
Id, D. Y. (2023). Examining the subjective fairness of at-home and online tests: Taking Duolingo English Testas an example. PLoS ONE 18(9): e0291629. https://doi.org/10.1371/journal. pone.0291629
Irawan, S., & Ahmad, Y. B. (2021). Students ’ Perceptions of Listening Learning Using the Bottom-up Strategy. IDEAS Journal of Language Teaching and Learning, Linguistics and Literature, 4778, 94–102. https://doi.org/10.24256/ideas.v9i2.1993
Kiran, A. (2023). English Language Assessment: Innovations, Validity, And Reliability. Journal of International English Research Studies, 1(2), 1–8. https://languagejournals.com/index.php/englishjournal/article/view/8
Kirkpatrick, A. (2014). English in Southeast Asia: Pedagogical and policy implications. World Englishes, 33(4), 426–438. https://doi.org/10.1111/weng.12105
Kunnan, A. J. (2010). Statistical analyses for test fairness. Rev. Franç. de Linguistique Appliquée, 1.
Li, S. (2016). The Construct Validity of Language Aptitude: A Meta-Analysis. Studies in Second Language Acquisition, 38(4), 801–842. https://doi.org/DOI: 10.1017/S027226311500042X
Linacre, J. M. (2020). Rasch measurement training seminars: Winsteps and Facets. University of Sydney Australia (pp. 1–22).
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276–282.
Miao, Y. (2024). Factors Affecting Listener Perception of Accented Speech: The Role of Accent Familiarity and Linguistic Training. International Journal of Listening, 38(3), 203–215. https://doi.org/10.1080/10904018.2023.2252019
Muchlisin, M., Mardapi, D., & Setiawati, F. A. (2019). An analysis of Javanese language test characteristic using the Rasch model in R program. REID (Research and Evaluation in Education), 5(1), 61–74. https://doi.org/10.21831/reid.v5i1.23773
Mufrihah, A. (2025). Rasch Model Analysis of Santri Reverence Morals Scale. Islamic Guidance and Counseling Journal, 8, 1–18.
Nishizawa, Hitoshi. (2023). Construct validity and fairness of an operational listening test with World Englishes. Language Testing, 40(3), 493–520. https://doi.org/10.1177/02655322221137869
O’Loughlin, K. (2013). Developing the Assessment Literacy of University Proficiency Test Users. Language Testing, 30(3), 363–380. https://doi.org/10.1177/0265532213480336
Pinto, J. O., Dores, A. R., Peixoto, B., & Barbosa, F. (2025). Ecological validity in neurocognitive assessment: Systematized review, content analysis, and proposal of an instrument. Applied Neuropsychology. Adult, 32(2), 577–594. https://doi.org/10.1080/23279095.2023.2170800
Ramadhianti, A., & Somba, S. (2022). Listening Comprehension Difficulties in Indonesian EFL Students. Journal of Learning and Instructional Studies, 1(3 SE-Articles), 111–121. https://doi.org/10.46637/jlis.v1i3.7
Shaw, A. (2023). Idea-Sharing Crafting Item Difficulty in TOEFL iBT Listening Tests. Pasaa, 66(October), 212–225. https://doi.org/10.58837/chula.pasaa.66.1.7
Shin, J., Guo, Q., & Gierl, M. J. (2019). Multiple-Choice Item Distractor Development Using Topic Modeling Approaches. Frontiers in Psychology, 10(April), 1–14. https://doi.org/10.3389/fpsyg.2019.00825
Vandergrift, L., & Goh, C. (2009). The Handbook of Language Teaching. Wiley-Blackwell Copyright, 395–411.
Yudkowsky, R., Park, Y. S., & Downing, S. M. (2020). Assessment in Health Professions Education (Routledge (ed.); Sedonf Edi). Routledge.
Zhai, X., Haudek, K. C., Wilson, C., & Stuhlsatz, M. (2021). A Framework of Construct-Irrelevant Variance for Contextualized Constructed Response Assessment. Frontiers in Education, 6(October), 1–13. https://doi.org/10.3389/feduc.2021.751283


