帮助 本站公告
您现在所在的位置:网站首页 > 知识中心 > 文献详情
文献详细Journal detailed

Detecting Rater Drift on an Oral English Performance Test with a Multi-faceted Rasch Model

导  师: 刘建达

学科专业: 040102

授予学位: 硕士

作  者: ;

机构地区: 广东外语外贸大学

摘  要: 语言的行为测试衡量了受试者完成与实际生活息息相关的语言任务的能力。因此行为测试的分数能够提供有关受试者语言实际运用能力的精确、有效的信息。但是行为测试一个很大的不足就是其仍然是人工评卷。在评卷的过程中,由于评卷员的因素而产生的评卷误差是在所难免的。而这些误差会对受试者的分数产生很大的影响,如果是在高风险考试中,甚至会对受试者的命运产生影响。因此,控制评卷员因素对于行为测试分数效度的影响是非常必要的。 本文基于中国某省的一次高考英语口语测试来研究评卷员严厉度的变化以及其它主要的评卷员效应。采用FACETS软件分析了15个评卷员共360份试卷。研究发现评卷员之间的严厉度是不同的,并且评卷员的严厉度随着时间的变化而变化,但是严厉度的变化在可接受的范围内。只有第9和第10个评卷员的严厉度的变化幅度较大。大部分的评卷员评卷都有较好的内部一致性,但是评卷员10和15在使用评分标准时,内部一致性较差。总体上来说,评卷员有集中趋势,表现为在评语言和流畅度两个特征时,过分使用某个评分区间。另外,评卷员3,5,10和13表现出了晕轮效应。但是,总体上来说,评卷员并不存在晕轮效应。根据拟合度值来判断,在15个评卷员中,有6位评卷员的评分中出现了Rasch模型预料之外的评分。考试管理者应该进一步跟踪检查这些模型预料之外的评分,找出问题所在,以便重新培训评卷员,或及时替换不合格的评卷员,或者修正评分标准。研究表明FACETS软件在研究评卷员表现方面非常有用。FACETS软件所产生的结果可以为考试的管理方所用,锁定各个评卷员,以提高评卷员评分的精确性,将评卷员因素对于受试者分数的影响降低到最低的水平。 Performance test of language proficiency measures examinees’ abilities torespond to real life language tasks. The scores from performance testing, therefore,bring more accurate and valid information about the examinees’ ability of languageuse. However, one necessary limitation of such performance testing of language isthat examinees’ performances are rated by human raters. Human raters have beenknown to introduce error into the rating process. Such rater effects have an importantimpact upon examinees’ scores or even their future in high-stake tests. Therefore, it isnecessary to control the effect of human rating effects on the validity of the testscores. This study investigates the stability of rater severity over time and other majorrater effects on an Oral Test of NMET /(National Matriculation English Test/) in aprovince of China. Ratings from360examinees rated by15raters were pooledtogether and analyzed by FACETS3.58.0, a multi-faceted Rasch analysis program/(Linacre,2005/). The study found that the raters differed in their severity levels andtheir severity levels did not stay invariant over time. However, despite of thedifferences in rater severity, their changes of severity across sessions were acceptable.Rater9and Rater10changed more than expected. The majority raters were consistentin their application of the rating scale but rater10and rater15did not use the ratingscale as consistently as other raters because they showed more variation than wasacceptable from the expectation of the Rasch model. Raters on the whole were foundto have central tendency effect on the traits of fluency and language. In addition,Rater3,5,10and13exhibited possible halo effect. However, on the whole, haloeffect did not exist. Six raters out of fifteen were identified with outfit values largerthan their infit statistics, showing that these raters had assigned some ratingsunexpected by the model. Test administrators should proceed to check the unexpectedratings identified by the Facets program to retrain or remove the unqualified raters or to revise the rating scale. The study shows that FACETS is a useful tool for studyingrater performance. The results produced by FACETS program can be used by testadministrations to target individual raters to help improve rater accuracy. In this way,rater effects on examinees’ scores can be reduced to the minimal level.

关 键 词: 评卷员严厉度 变化 晕轮效应 集中趋势 多面 模型

分 类 号: [H319.3]

领  域: [语言文字]


作者 邓博文
作者 曹琳琳
作者 赵南
作者 吕剑涛
作者 杨满珍


机构 暨南大学
机构 华南理工大学
机构 暨南大学经济学院
机构 华南理工大学工商管理学院
机构 中山大学


作者 彭川
作者 彭晓春
作者 徐云扬
作者 成海涛
作者 曾小敏