Evaluating Comparability in the Scoring of Performance Assessments for Accountability Purposes

by

Susan Lyons, Ph.D., is an associate at the National Center for the Improvement of Educational Assessment. Carla Evans is a doctoral candidate in Assessment, Evaluation and Policy at the University of New Hampshire.

Researchers report on their evaluation of comparability claims in local scoring of performance assessments across districts participating in New Hampshire’s Performance Assessment of Competency Education pilot project.

This brief summarizes “Comparability in Balanced Assessment Systems for State Accountability,” published in Educational Measurement: Issues and Practice (Evans & Lyons, 2017). This study evaluated comparability claims in local scoring of performance assessments across districts participating in New Hampshire’s Performance Assessment of Competency Education (PACE) pilot project.

With the passage of the Every Student Succeeds Act (ESSA), there has been increasing attention on how states could design innovative assessment and accountability systems to submit for approval under the law. The challenge lies in designing assessment and accountability systems that can support instructional uses while serving accountability purposes (Baker & Gordon, 2014; Gong, 2010; Marion & Leather, 2015). New Hampshire’s PACE project is an example of one such system. Within PACE, local and common performance assessments administered throughout the school year contribute to students’ overall competency scores, which are in turn used to make annual determinations of student proficiency for state and federal accountability. The challenge is using the information from multiple, local assessment sources to support comparable scoring across districts.

Comparability

Comparability is not an attribute of a test or test form, nor is it a yes/no decision. Instead, comparability is the degree to which scores resulting from different assessment conditions can support the same inferences about what students know and can do. In other words, can the scores resulting from different assessment conditions be used to support the same uses (e.g., school evaluation)? Comparability becomes important when we make the claim that students and schools are being held to the same standard, particularly when those designations are used in a high-stakes accountability context.

Methods and Results

There are many different methods for gathering evidence to support score comparability evaluations. Contrasting conceptions of comparability typically include statistical and judgmental approaches, or some combination of the two (Baird, 2007; Newton, 2010). The chosen approach is dependent upon the nature of the assessments and the intended interpretations and use of the test scores (Gong & DePascale, 2013). We apply two judgmental methods for estimating comparability that are used in international contexts: consensus scoring and rank-ordering.1 

Consensus Scoring

The consensus scoring method involves pairing teachers together, each representing different districts, to score student work samples from students outside of either of their districts. The student work samples are common performance tasks given across districts in particular grade or subject areas. Examining the work samples one at a time, the judges discuss their individual scores and then come to an agreement on a consensus score. The purpose of collecting consensus score data is to approximate “true scores” for the student work. To detect any systematic discrepancies in the relatively leniency and stringency of district scoring, we calculated averages differences in local teacher scores and the consensus scores (mean deviation index). Using this index, a negative deviation indicates an underestimation of student scores by classroom teachers (i.e., district stringency), and positive mean deviation indicates overestimation of student scores by classroom teachers (i.e., district leniency).

Across all districts, the consensus scoring yielded scores that were positive, meaning they were a bit lower than the teacher-given scores. This finding itself is not necessarily problematic from a comparability perspective, as long as the relative leniency of the teacher-given scores is even across districts. Analysis revealed uneven scoring across districts, suggesting that there remains a need for additional training on scoring and within-district calibration, as well as for increased cross-district calibration.

Rank-Order Method

High school biology presented a unique challenge in calibrating the cross-district scores because there was no common performance assessment administered across districts in this discipline; each district developed and implemented completely unique tasks. Typically, score calibration procedures require one of two conditions: 1) common persons across tasks, or 2) common tasks across persons. Because neither of these conditions was satisfied in the 2014-15 implementation of high school science in PACE, we looked to alternate methods of score calibration and modeled our method after the rank-ordering cross-moderation method used in England.

The seven participating judges, all high school science teachers, were given packets of student work that had been grouped by average rubric score and represented student work from biology performance tasks from all four districts. After training and an opportunity to familiarize themselves with the different performance assessments from the four districts, the judges were instructed to rank papers within each packet based on merit, evidence of student understanding, demonstrated competence, and student knowledge of science, which are all different ways of saying “better,” as Bramley (2007) succinctly puts it. The rank orders from teacher judges were converted to scores, which were compared to original teacher scores.

The results revealed scoring differences across districts, most notably in one district in which teachers were scoring their student work a full standard deviation below (more rigorous) where the judges placed the same student work within the sample.

Conclusion

We found that applying the two methods in the context of the PACE system highlighted some strengths and limitations of each method. First, both methods provide comparability evidence in local scoring within a district. Both methods also involve teachers from multiple districts reviewing student work samples from other districts, which has the added benefit of providing a rich context for professional development. In previous research on the effects of high-stakes performance-based assessment systems on student performance (Borko, Elliott, & Uchiyama, 2002; Lane, Parke, & Stone, 2002; Parke, Lane, & Stone, 2006), professional development had a strong mediating effect on the relationship between the performance-based assessment system and changes in teacher instructional practices. Using a one of these methods not only provides the evidence necessary of comparability in local scoring, but also provides a built-in professional development opportunity for teachers.

That said, reviewing student work samples across districts is costly and time-consuming. The practicality and feasibility of scaling up the proposed methods in a large-scale performance assessment program is a real concern, particularly within a state that has many more districts or other units with a large number of different local assessment systems. One way New Hampshire has attempted to address scale issues is through improved technology. As this project continues to scale, New Hampshire is undergoing an intensive research and development process to procure additional software that will support the scaling of this effort through virtual task design and scoring.

States awarded flexibility under ESSA’s Innovative Assessment Demonstration Authority will have to demonstrate that all students have the same opportunity to learn and are held to the same performance expectations. In so doing, accountability systems based on school-based assessments or other innovative assessment systems permitted under the Innovative Assessment Demonstration Authority must provide evidence to support comparability claims. The methods presented in this brief provide tools to strengthen the body of evidence related to the comparability of scores across districts implementing an innovative system of assessments.

Related topics: 

1 Consensus scoring is a version of external moderation used in Queensland, Australia (Queenland Studies Authority, 2014). Rank-ordering is a version of cross-moderation used in England (Bramley, 2005, 2007). 

Baird, J. A. (2007). Alternative conceptions of comparability. In P. Newton, J.-A. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 124-165). London: Qualifications and Curriculum Authority. Retrieved from http://hdl.handle.net/1983/1004

Baker, E. L., & Gordon, E. W. (2014). From the assessment OF education to the assessment FOR education: Policy and futures. Teachers College Record, 116(11).

Borko, H., Elliott, R., & Uchiyama, K. (2002). Professional development: A key to Kentucky’s educational reform effort. Teaching and Teacher Education, 18, 969-987. http://doi.org/10.1016/S0742-051X(02)00054-9

Bramley, T. (2005). A rank-ordering method for equating tests by expert judgment. Journal of Applied Measurement, 6(2), 202-223.

Bramley, T. (2007). Paired comparison methods. In P. E. Newton, J. A. Baird, H. Goldstein, H. Patrick, & P. Tymms (Eds.), Techniques for monitoring the comparability of examination standards (pp. 246-300). London: Qualifications and Curriculum Authority.

Evans, C., & Lyons, S. (2017). Comparability in innovative assessment systems for state accountability. Educational Measurement: Issues and Practice. http://doi:10.1111/emip.12152

Gong, B. (2010). Using balanced assessment systems to improve student learning and school capacity: An introduction. Washington, DC: Council of Chief State School Officers and Rennaissance Learning.

Gong, B., & DePascale, C. (2013). Different but the same: Assessment “comparability” in the era of the Common Core State Standards. Washington, DC: The Council of Chief State School Officers.

Lane, S., Parke, C. S., & Stone, C. A. (2002). The impact of a state performance-based assessment and accountability program on mathematics instruction and student learning: Evidence from survey data and school performance. Educational Assessment, 8(4), 279-315. http://doi.org/10.1207/S15326977EA0804

Marion, S., & Leather, P. (2015). Assessment and accountability to support meaningful learning. Education Policy Analysis Archives, 23(9). Retrieved from http://dx.doi.org/10.14507/epaa.v23.1984

Newton, P. E. (2010). Contrasting conceptions of comparability. Research Papers in Education, 25(3), 285-292. http://doi.org/10.1080/02671522.2010.498144

Parke, C. S., Lane, S., & Stone, C. A. (2006). Impact of a state performance assessment program in reading and writing. Educational Research and Evaluation, 12(3), 239-269. http://doi.org/10.1080/13803610600696957

Queenland Studies Authority. (2014). School-based assessment: The Queensland assessment. Queensland, Australia: The State of Queensland (Queensland Studies Authority).