|Year : 2015 | Volume
| Issue : 3 | Page : 220-225
Angoff's method: The impact of raters' selection
Assad A Rezigalla
Department of Anatomy, College of Medicine, International University of africa, Khartoum, Sudan
|Date of Web Publication||3-Aug-2015|
Assad A Rezigalla
Department of Anatomy, College of Medicine, International University of africa, Khartoum
Background: Several methods have been proposed for setting an examination pass mark (PM), and the Angoff's method or its modified version is the preferred one. Selection of raters is important and affects the PM.
Aims and Objectives: This study aims to investigate the selection of raters in the Angoff's method and the impact of academic degrees and experience on the PM decided on.
Materials and Methods: Type A MCQs examination was used in this study as a model. Raters with different academic degrees and experience participated in the study. Raters estimations were statiscally analyzed.
Results: The selection of raters was crucial. Agreement among raters could be achieved by those with relevant qualifications and expertise. There was an association between high estimation, academic degree, expertise and high PM.
Conclusion: Selection of raters for the Angoff's method should include those with different academic degrees, backgrounds and experience so that a satisfactory PM may be reached by means of a reasonable agreement.
ملخص البحث :
هدفت الدراسة إلى تقييم اختيار المقيمين وتأثير مؤهلاتهم العلمية وخبراتهم على تحديد درجة النجاح. تم استخدام اختبار مكون من أسئلة متعددة الخيارات من النوع الاول. شارك في الدراسة مقيمون بدرجات علمية وخبرات تدريسية مختلفة في تحليل التقييمات الناتجة إحصائيا. وكان لاختيار المقيمين تأثيراً مفصلياً. ووجد توافق بين المقيمين ذوي الدرجات العلمية والخبرات المتقاربة. والخلاصة أن اختيار المقيمين يجب أن يشمل مختلف الدرجات والخبرات والخلفيات حتى يمكن الوصول لدرجة نجاح بالتوافق بين المقيمين.
Keywords: Academic degree, Angoff′s method, experience, raters′ selection, setting pass mark
|How to cite this article:|
Rezigalla AA. Angoff's method: The impact of raters' selection
. Saudi J Med Med Sci 2015;3:220-5
| Introduction|| |
The pass mark (PM) in educational testing is the standard criterion that determines whether a student passes or fails an examination. This determines whether the student is considered competent enough or not. Accordingly, the PM and the procedures used for its setting depend on a number of legal, professional, theoretical, and psychometric issues. ,,,,,
Several methods have been proposed for setting a PM. However, Angoff's method is the preferred method and the most often used. ,,,, It is the most popular method for multiple-choice questions.  It can be used for both medium and high stake examinations and also appropriate for an OSCE ,, or even testing by the computer. 
The Angoff method was developed from extensive research on a footnote to a chapter of a book written by Angoff.  This criterion reference method suggests how judgments about minimally competent students can be used to set a cut-off score.
The Angoff method involves asking judges to estimate the probability of a minimally competent student's ability to answer each item on a test correctly. ,,, Application of Angoff's method depends on the definition of the minimally competent student and the raters' (Judges) judgments about the questions.
This definition of a minimally competent student forms the basis of any judgment about setting PM. A common procedure is to allow the raters to determine a definition as a group and have them all use this definition to make their judgments.  In other cases, a preexisting definition is given to the raters, or the raters are asked to define the minimally competent student independently. The latter method eventually results in a wide divergence of the judges' estimates. ,
The other strength of Angoff's method is the raters. The raters should be familiar with Angoff's method, the student, the curriculum  and the course being assessed. , Many modifications have been made on the original method with regard to the raters. ,, A common modification is to allow the raters to discuss their estimates with each other , although this has many drawbacks , such as the dominance of one rater on the committee. Another modification is iteration of estimations.  These modifications have been applied to increase the reliability of the rating (judgments) by increasing the intra and inter-rater's consistency , and reducing variability among raters and the cut-off score.  By increasing the reliability of the judgments and reducing variability, the degree of error in the resulting PM is reduced. 
Few studies have considered the selection of raters, , training , and their interaction. , This study aims to investigate the effect of the selection of raters in Angoff's method on the suggested PM.
| Materials and methods|| |
This study was conducted in the Department of Anatomy, College of Medicine, King Khalid University. The College of Medicine adopted the traditional curriculum in teaching medicine in 12 semesters. According to the university regulations, the examination PM is 60. The examination sample used in this study was a final exam on Anatomy given to semester four students (March 2014). Standard settings were followed during preparation and administration of the examination. The examination consisted of 55 MCQs of Type A variety and the time allowed was 2 h.
This article aimed at studying the impact of academic degree and experience on the PM arrived at.
Staff members in this study were chosen according to standard settings. All were familiar with both the students and the curriculum. The staff members (raters) were categorized as associate professor (ST3), assistant professor (ST2) and lecturer (ST1). There were 15 raters in total, five in each category. They had a teaching experience of 15 ± 2.0, 12 ± 2.1 and 26 ± 3.5 years for ST1, ST2, and ST3 respectively. These staff members underwent a short training course on the Angoff's method and the setting of PM and a committee formed from each category of raters whose task it was to set a PM. The final committee formed of all raters had to set a final PM.
The raters were instructed that the minimally competent student could not have 100% estimation of answering the questions correctly nor <25% on a question.  The differences in raters' estimations were accepted within 30% or at or below 10 units of standard deviation of estimations for each question.
The raters were given the exam and asked to individually estimate how a minimally competent student would perform on the questions. The raters had a meeting to discuss the estimations and reach a consensus. The PM was calculated from the mean of the estimations.
The raters' estimations were calculated. The degree of agreement among raters, the inter-raters agreement was calculated by Kappa statistic.
To calculate the percentage of high estimations (HEs) of each category of raters, any two equal HEs were omitted. The remaining estimations were 48 out of 55 of the total number of questions. The percentage of HE for each category was calculated.
The PMs for each committee of raters' and from all raters' committees were calculated. The PMs were calculated from the means of raters' estimations [Figure 1]. The final PM of all raters' committees was found to be 58.9 out of 100.
|Figure 1: Shows the pass marks and percentage of high estimation of raters and the success rates of the students.PM – Pass mark; FS – Fixed standard; SR – Success rate; HE – High estimations|
Click here to view
The PM credibility was determined by comparing the PMs obtained by Angoff's method to the fixed pass mark (FPM) and norm-referenced (NR) PMs. ,,
Raters' estimations were analyzed statically and the results presented as mean ± standard deviation. Differences, correlations inter-rater's agreement were evaluated (SPSS for windows version 15.0. Armonk, NY: IBM Corp, USA).
The present study discusses the impact of raters' academic degrees and the experience on the resulting PM.
| Results|| |
The number of estimations recorded from all categories' of raters (15 raters) for the 55 questions was 825. The percentage of the HE was calculated for each category. The ST3 committee had the highest percentage of HEs (45.5%) and the ST1 committee the lowest (16.4%). The percentage of HEs increased in association with both academic degrees and experience [Figure 1].
Kappa statistic was used to determine agreement between raters. High percentages of agreement were recorded between the committees of ST2 and ST3, then ST1 and ST3 and the lowest was between ST1 and ST2 [Table 1].
There was strong correlation between the committees of ST3 and ST1 (0.771) and to a lesser extent between ST2 and ST1 (0.529) and ST2 and ST3 (0.473) respectively [Table 2].
All PMs were calculated out of 100. AR PM was 58.9. The PMs of ST3, ST2 and ST1 were 61.8, 58.4 and 58.1 respectively. One way ANOVA test showed a non-significant difference between all PMs [Table 3]. The committees of ST3 and ST2, ST3 and ST1, ST2 and ST1 and AR ended with PMs of 60.1, 60, 58.3 and 58.9 respectively.
Paired sample test showed a significant difference between FPM, NR and final committee (AR) PMs (P < 0.05).
| Discussion|| |
Many modifications were made and applied to the original Angoff's method. Most of these were directed towards reaching a better agreement among raters, but a few were directed at the raters, selection of raters and interaction. This study investigated raters' selection and its effect on the resulting PM.
The raters in this study had varying academic degrees and experiences and were of a higher level than students. Being involved in teaching they were considered qualified. Norcini  emphasizes on the importance of a mixed committee more than number. The committee had to include different professional roles and a balance of personal attributes including gender, race and age. , Committees in the present study differed in the academic degrees they had, background and age. This mix of members had no conflict of interest. According to Verhoeven et al.,  the difference in backgrounds and expertise can offset the influence of small number of raters in the committee. In the literature, the number of raters in the committees varied  from a few (5-10),  (10-15),  to many (5-30),  to as many as possible  and even by using the root mean squared error to determine the number of raters. 
Verheggen et al.  reported that Angoff's estimates were significantly affected by the rater's ability to answer the questions correctly or give the model answers. These findings stress the importance of a careful selection of raters in Angoff's method. This implies that the judges should be selected from the group who are not only capable of conceptualizing the "minimally competent student," but also capable of answering all the items correctly, and have expertise in the domain assessed by the test. 
The use of recently graduated staff members as raters is justifiable in Angoff's method.  In the present work, the use of lecturers in estimating the PM ends with the same result. The ST1 group formed the recent consumers of the curriculum with regard to examinations whether general or specific. Since students' learning is driven by examinations , they constitute the real curriculum.  This group, therefore, have the effective knowledge and can target the borderline student more accurately. The limited experience of the ST1 in teaching did not affect their estimations, as Angoff's method does not target the delivery of knowledge.
Angoff's method targets the minimally competent student as cut-off score by reaching an agreement between rates. In the present study, the committees of ST3 and ST2 had a high percentage of agreement and low correlation. Although both ST2 and ST1 and ST3 and ST2 committees had low percentage of agreement, that of ST3 and ST1 was higher. The differences in academic degrees and experience of ST3 and ST1 affected the agreement within the committee in spite of the correlations between estimations. In the present study, there were associations between high academic degrees, experience, HEs and PM with inter-raters agreement. These findings suggest that acceptable degrees of agreement can be reached by selecting a committee of raters with relevant academic degrees and experience. These findings also support the work by Verheggen et al.  who indicated that the rating in Angoff's method depended on the quality of the panel members.
A comparison of the percentages of agreement among committees shows that although there are big differences between ST3 and ST1 in experience, the academic degrees and significant differences in estimations, there was a strong correlation and better agreement than in ST2 and ST1. ST2 was related to both ST3 and ST1 but was more in agreement with ST3. The committees of ST2 and ST1 were closer to students, but their correlation and agreement were in the middle and low respectively. Thus, the inter-rater's agreement appeared to be affected more by experience than closeness to students.
High estimations of raters did not affect the agreement within committees. The percentage of HEs increased with both academic degree and experience. Schoon et al.  noted unrealistic high PMs among expert judges although Angoff's method is associated with low PMs.  These, consequently, had an effect on the resulting PM in the case of a committee with a single category of highly qualified expert raters.
Angoff's method and its modifications concentrated on reaching a high degree of agreement between raters without regard to the resulting PM whether low or high. Although the method has been linked to high pass rates and low PM,  the PM itself in Angoff's method is not of concern as it focuses mainly on both the minimally competent student and the exam. The PM of the final committee of all raters (ST3, ST2 and ST1) was 58.9 out of 100. There was no significant difference between the PMs of the final committee and the different categories of raters committees. The PM developed by Angoff's method, the FPM and the NR were not significantly different in the present work. The ST3 committee gave the highest PM, and the ST1 committee the lowest of all PMs. The present result is in accordance with a previous work by Norcini and Shea  who indicated that different groups of experts set the same standard for the same test material, and that a committee of expert raters set an unrealistic high PM.  ST3 and ST1 committees produced a medium PM. The PM correlated positively with both the academic degree and the experience of the raters.
| Conclusion|| |
The present study showed that agreement can be reached by a selection of raters with the same or similar qualifications and experience. Moreover, the percentage of HEs and the PM increased with an increase in the academic degree. A committee of raters with high academic degrees and experience resulted in high PMs and vice versa. Thus, the mode of committee selection can alter the resulting PM.
The selection of raters for Angoff's method should include raters with different academic degrees and experience to arrive at an agreement. This method of selection will produce a reasonable PM by means of a satisfactory agreement.
| Acknowledgments|| |
The author acknowledge the effort of the raters who participated in the study. Great appreciation was to Dr. S. Bashir, Dr. O. Elfaki, Prof. J. Haidera and Prof. M. Habieb for their comments. Great thanks to Mr. Abid MK for the statistical analysis and the helpful comments. The comments of Dr. El. Mekki A are highly appreciated. Special thanks to Prof. M. Atiff. College Dean and Administration of the College of Medicine (KKU, KSA) are appreciated for help and allowing the use of facilities.
| References|| |
Biddle RE. How to set cut off scores for knowledge tests used in promotion, training, certification, and licensing. Public Pers Manag 1993; 22:63-80. . Last access November 15, 2014.
Cascio WF, Alexander RA, Barrett GV. Setting cutoff scores: Legal, psychometric, and professional issues and guidelines. Pers Psychol 1988;41:1-24.
Cizek GJ. Reconsidering standards and criteria. J Educ Meas 1993;30:93-106.
Kane M. Validating the performance standards associated with passing scores. Rev Educ Res 1994;64:425-61.
Maurer TJ, Alexander RA. Methods of improving employment test critical scores derived by judging test content: A review and critique. Pers Psychol 1992;45:727-62.
Ahn DS, Ahn S. Reconsidering the cut score of Korean National Medical Licensing Examination. J Educ Eval Health Prof 2007;4:1.
Angoff W. Scales, norms, and equivalent scores. Educational Measurement: Theories and Applications. Vol. 2Edictional testing services. Princeton, New Jersey 1996. p. 121.
Berk RA. A consumer's guide to setting performance standards on criterion-referenced tests. Rev Educ Res 1986;56:137-72.
Impara JC, Plake BS. Teachers' ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. J Educ Meas 1998;35:69-81.
Kaufman DM, Mann KV, Muijtjens AM, van der Vleuten CP. A comparison of standard-setting procedures for an OSCE in undergraduate medical education. Acad Med 2000;75:267-71.
Boursicot KA, Roberts TE, Pell G. Using borderline methods to compare passing standards for OSCEs at graduation across three medical schools. Med Educ 2007;41:1024-31.
Senthong V, Chindaprasirt J, Sawanyawisuth K, Aekphachaisawat N, Chaowattanapanit S, Limpawattana P, et al.
Group versus modified individual standard-setting on multiple-choice questions with the Angoff method for fourth-year medical students in the internal medicine clerkship. Adv Med Educ Pract 2013;4:195-200.
Siriwardena AN, Dixon H, Blow C, Irish B, Milne P. Performance and views of examiners in the Applied Knowledge Test for the nMRCGP licensing examination. Br J Gen Pract 2009;59:e38-43.
Reilly RR, Zink DL, Israelski EW. Comparison of direct and indirect methods for setting minimum passing scores. Appl Psychol Meas 1984;8:421-9.
Hurtz GM, Auerbach MA. A meta-analysis of the effects of modifications to the Angoff method on cutoff scores and judgment consensus. Educ Psychol Meas 2003;63:584-601.
Ricker KL. Setting cut-scores: A critical review of the Angoff and modified Angoff methods. Alberta J Educ Res 2006;52:53-64.
Cizek GJ, Bunch MB. Standard setting: A Guide to Establishing and Evaluating Performance Standards on Tests. Okas. SAGE Publications Ltd.; 2007.
Fehrmann ML, Woehr DJ, Arthur W. The Angoff cutoff score method: The impact of frame-of-reference rater training. Educ Psychol Meas 1991;51:857-72.
Norcini JJ. Research on standards for professional licensure and certification examinations. Eval Health Prof 1994;17:160-77.
Norcini J, Shea J. The reproducibility of standards over groups and occasions. Appl Meas Educ 1992;5:63-72.
Hambleton RK. Setting performance standards on educational assessments and criteria for evaluating the process. Setting Performance Standards: Concepts, Methods, and Perspectives. Mahwah, NJ: Lawrence Erlbaum Publishers. 2001. p. 89-116.
Yudkowsky R, Downing SM, Popescu M. Setting standards for performance tests: A pilot study of a three-level Angoff method. Acad Med 2008;83:S13-6.
Busch JC, Jaeger RM. Influence of type of judge, normative information, and discussion on standards recommended for the National Teacher Examinations. J Educ Meas 1990;27:145-63.
Hambleton RK, Plake BS. Using an extended Angoff procedure to set standards on complex performance assessments. Appl Meas Educ 1995;8:41-55.
Truxillo DM, Donahue LM, Sulzer JL. Setting cutoff scores for personnel selection tests: Issues. Illustrations, and recommendations. Hum Perf 1996;9:275-95.
Plake BS, Impara JC, Irwin PM. Consistency of Angoff-based predictions of item performance: Evidence of technical quality of results from the Angoff standard setting method. J Educ Meas 2000;37:347-55.
Plake BS, Impara JC. Ability of panelists to estimate item performance for a target group of candidates: An issue in judgmental standard setting. Educ Assess 2001;7:87-97.
Chang L. Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Appl Meas Educ 1999;12:151-65.
Wheaton A, Parry J, editors. Using the Angoff Method to Set Cut Scores. Users Conference; 2012. Available from: https://www.questionmark.com/us/seminars/Documents/webinar_anoff_handout_may_2012.pdf. last access April 23 2012.
George S, Haque MS, Oyebode F. Standard setting: Comparison of two methods. BMC Med Educ 2006;6:46.
Bhandary S. Standard setting in health professions education. Kathmandu Univ Med J 2012;9:3-4.
Norcini JJ. Setting standards on educational tests. Med Educ 2003;37:464-9.
Verhoeven BH, Verwijnen GM, Muijtjens AM, Scherpbier AJ, van der Vleuten CP. Panel expertise for an Angoff standard setting procedure in progress testing: Item writers compared to recently graduated students. Med Educ 2002;36:860-7.
Verhoeven BH, van der Steeg AF, Scherpbier AJ, Muijtjens AM, Verwijnen GM, van der Vleuten CP. Reliability and credibility of an Angoff standard setting procedure in progress testing using recent graduates as judges. Med Educ 1999;33:832-7.
Hurtz GM, Hertz NR. How many raters should be used for establishing cutoff scores with the Angoff method? A generalizability theory study. Educ Psychol Meas 1999;59:885-97.
Zieky M, Livingston SA. Manual for setting standards on the basic skills assessment tests. Princeton, NJ: Educational Testing Service; 1977. p. 235.
Cizek GJ. An NCME instructional module on: Setting passing scores. Educ Meas Issues Pract 1996;15:20-31.
Fowell SL, Fewtrell R, McLaughlin PJ. Estimating the minimum number of judges required for test-centred standard setting on written assessments. do discussion and iteration have an influence? Adv Health Sci Educ Theory Pract 2008;13:11-24.
Verheggen MM, Muijtjens AM, Van Os J, Schuwirth LW. Is an Angoff standard an indication of minimal competence of examinees or of judges? Adv Health Sci Educ Theory Pract 2008;13:203-11.
Jaeger RM. Selection of judges for standard setting. Educ Meas Issues Pract 1991;10:3-14.
Newble DI, Jaeger K. The effect of assessments and examinations on the learning of medical students. Med Educ 1983;17:165-71.
Frederiksen N. The real test bias: Influences of testing on teaching and learning. Am Psychol 1984;39:193.
Schoon CG, Gullion CM, Ferrara P. Bayesian statistics, credentialing examinations, and the determination of passing points. Eval Health Prof 1979;2:181-201.
Wayne DB, Fudala MJ, Butter J, Siddall VJ, Feinglass J, Wade LD, et al.
Comparison of two standard-setting methods for advanced cardiac life support training. Acad Med 2005;80:S63-6.
[Table 1], [Table 2], [Table 3]