RASCH MODEL-BASED EVALUATION OF TOEFL LISTENING ITEMS: ANALYZING DIFFICULTY, DISCRIMINATION, AND FIT

Authors

  • Bambang Abdul Syukur University of Kusuma Husada Surakarta
  • Ari Febru Nurlaily Nursing Study Program of Diploma Three Programs, University of Kusuma Husada Surakarta

DOI:

https://doi.org/10.52657/js.v11i2.2931

Keywords:

Discrimination, Item Difficulty, Rasch analysis, TOEFL Listening, Validity

Abstract

This study analyzed TOEFL Listening Section items using the Rasch. Quantitative analysis of 200 participants revealed significant problems: (1) 7 misfit items (Infit MNSQ >1.5), particularly items of extreme difficulty; (2) 4 items with negative discrimination, indicating non-construct variance contamination; and (3) measurement gaps in the Wright Map (dead zones for low-ability participants and ceiling effects for high-ability groups). The findings confirm the structural weaknesses of the test design, recommending item revision, strategic additions, and redistribution to enhance validity and assessment fairness. This study underscores the need for psychometrically sensitive approaches in high-stakes language assessment. 

Author Biography

Bambang Abdul Syukur, University of Kusuma Husada Surakarta

D3 Keperawatan (Pendidikan Bahasa Inggris)

References

Abdul Aziz, A., Jusoh, M. S., Amlus, M. H., Omar, A. R., & Awang Salleh, T. S. (2014). Construct Validity: A Rasch Measurement Model Approaches. Journal of Applied Science and Agriculture, 9(12), 7–12. https://www.researchgate.net/publication/266676182

Arfiani, Y., Susongko, P., & Kusuma, M. (2023). Construct validity analysis with messick validity approach and rasch model application on scientific reasoning test items. Thabiea : Journal of Natural Science Teaching, 6(1), 90–105.

Aryadoust, V., Ng, L. Y., & Sayama, H. (2021). A comprehensive review of Rasch measurement in language assessment: Recommendations and guidelines for research. Language Testing, 38(1), 6–40. https://doi.org/10.1177/0265532220927487

Boone, W. J., Staver, J. R., Yale, M. S., & Analysis, R. (2020). Rasch Analysis in the Human Sciences. Journal of Research Design and Statistics and Communicatiob Science, August 2019. https://doi.org/10.1558/jrds.37535

Buck, G. (2001). Assessing Listening. In Cambridge Language Assessment. Cambridge University Press. https://doi.org/DOI: 10.1017/CBO9780511732959

Chapelle, C. A. (2022). Argument-Based Validation in Testing and Assessment. In Sage Research Methods. https://doi.org/10.4135/9781071878811

Chapelle, C., & Lee, H. (2021). Understanding Argument-Based Validity in Language Testing (pp. 19–44). https://doi.org/10.1017/9781108669849.004

Dewi, H. H., Damio, S. M., & Sukarno, S. (2023). Item analysis of reading comprehension questions for English proficiency test using Rasch model. REID (Research and Evaluation in Education), 9(1), 24–36. https://doi.org/10.21831/reid.v9i1.53514

Elhambakhsh, E. (2024). The Role of Construct Validity in Designing English Language Assessment Tasks. Journal of English Language Teaching and Learning, 16(34), 55–78. https://doi.org/10.22034/elt.2024.61423.2638

Fan, J., & Knoch, U. (2019). Fairness in language assessment : What can the Rasch model offer ?. Language Testing and Assessment, 8(2), 117-142.

Forero, J., Vette, A. H., & Hebert, J. S. (2023). Technology ‑ based balance performance assessment can eliminate floor and ceiling effects. Scientific Reports, 0123456789, 1–11. https://doi.org/10.1038/s41598-023-41671-8

Futri, V. I., Rosnawati, R., Rahim, A., & Marlina, M. (2022). Rasch Model Study on Mathematics Examination Test Using Item Response Theory Approach. International Journal on Emerging Mathematics Education, 6(1), 29. https://doi.org/10.12928/ijeme.v6i1.21761

Gilakjani, A. P., & Sabouri, N. B. (2016). Learners’ Listening Comprehension Difficulties in English Language Learning: A Literature Review. English Language Teaching, 9(6), 123. https://doi.org/10.5539/elt.v9n6p123

Goh, C.C.M., & Vandergrift, L. (2021). Teaching and Learning Second Language Listening: Metacognition in Action (2nd ed.). Routledge. https://doi.org/https://doi.org/10.4324/9780429287749

Habibi, H., Jumadi, J., & Mundilarto, M. (2019). The rasch-rating scale model to identify learning difficulties of physics students based on self-regulation skills. International Journal of Evaluation and Research in Education, 8(4), 659–665. https://doi.org/10.11591/ijere.v8i4.20292

Id, D. Y. (2023). Examining the subjective fairness of at-home and online tests: Taking Duolingo English Testas an example. PLoS ONE 18(9): e0291629. https://doi.org/10.1371/journal. pone.0291629

Irawan, S., & Ahmad, Y. B. (2021). Students ’ Perceptions of Listening Learning Using the Bottom-up Strategy. IDEAS Journal of Language Teaching and Learning, Linguistics and Literature, 4778, 94–102. https://doi.org/10.24256/ideas.v9i2.1993

Kiran, A. (2023). English Language Assessment: Innovations, Validity, And Reliability. Journal of International English Research Studies, 1(2), 1–8. https://languagejournals.com/index.php/englishjournal/article/view/8

Kirkpatrick, A. (2014). English in Southeast Asia: Pedagogical and policy implications. World Englishes, 33(4), 426–438. https://doi.org/10.1111/weng.12105

Kunnan, A. J. (2010). Statistical analyses for test fairness. Rev. Franç. de Linguistique Appliquée, 1.

Li, S. (2016). The Construct Validity of Language Aptitude: A Meta-Analysis. Studies in Second Language Acquisition, 38(4), 801–842. https://doi.org/DOI: 10.1017/S027226311500042X

Linacre, J. M. (2020). Rasch measurement training seminars: Winsteps and Facets. University of Sydney Australia (pp. 1–22).

McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276–282.

Miao, Y. (2024). Factors Affecting Listener Perception of Accented Speech: The Role of Accent Familiarity and Linguistic Training. International Journal of Listening, 38(3), 203–215. https://doi.org/10.1080/10904018.2023.2252019

Muchlisin, M., Mardapi, D., & Setiawati, F. A. (2019). An analysis of Javanese language test characteristic using the Rasch model in R program. REID (Research and Evaluation in Education), 5(1), 61–74. https://doi.org/10.21831/reid.v5i1.23773

Mufrihah, A. (2025). Rasch Model Analysis of Santri Reverence Morals Scale. Islamic Guidance and Counseling Journal, 8, 1–18.

Nishizawa, Hitoshi. (2023). Construct validity and fairness of an operational listening test with World Englishes. Language Testing, 40(3), 493–520. https://doi.org/10.1177/02655322221137869

O’Loughlin, K. (2013). Developing the Assessment Literacy of University Proficiency Test Users. Language Testing, 30(3), 363–380. https://doi.org/10.1177/0265532213480336

Pinto, J. O., Dores, A. R., Peixoto, B., & Barbosa, F. (2025). Ecological validity in neurocognitive assessment: Systematized review, content analysis, and proposal of an instrument. Applied Neuropsychology. Adult, 32(2), 577–594. https://doi.org/10.1080/23279095.2023.2170800

Ramadhianti, A., & Somba, S. (2022). Listening Comprehension Difficulties in Indonesian EFL Students. Journal of Learning and Instructional Studies, 1(3 SE-Articles), 111–121. https://doi.org/10.46637/jlis.v1i3.7

Shaw, A. (2023). Idea-Sharing Crafting Item Difficulty in TOEFL iBT Listening Tests. Pasaa, 66(October), 212–225. https://doi.org/10.58837/chula.pasaa.66.1.7

Shin, J., Guo, Q., & Gierl, M. J. (2019). Multiple-Choice Item Distractor Development Using Topic Modeling Approaches. Frontiers in Psychology, 10(April), 1–14. https://doi.org/10.3389/fpsyg.2019.00825

Vandergrift, L., & Goh, C. (2009). The Handbook of Language Teaching. Wiley-Blackwell Copyright, 395–411.

Yudkowsky, R., Park, Y. S., & Downing, S. M. (2020). Assessment in Health Professions Education (Routledge (ed.); Sedonf Edi). Routledge.

Zhai, X., Haudek, K. C., Wilson, C., & Stuhlsatz, M. (2021). A Framework of Construct-Irrelevant Variance for Contextualized Constructed Response Assessment. Frontiers in Education, 6(October), 1–13. https://doi.org/10.3389/feduc.2021.751283

Downloads

Additional Files

Published

01-08-2025

How to Cite

Syukur, B. A., & Nurlaily, A. F. (2025). RASCH MODEL-BASED EVALUATION OF TOEFL LISTENING ITEMS: ANALYZING DIFFICULTY, DISCRIMINATION, AND FIT. Jurnal Smart, 11(2), 176–191. https://doi.org/10.52657/js.v11i2.2931

Issue

Section

Articles

Similar Articles

1 2 3 4 5 6 7 > >> 

You may also start an advanced similarity search for this article.