⬅ Previous Lesson Workshop Index Next Lesson ➡

Introduction to Assessment Part II

Objective

In this workshop, you will familiarize yourself with the concept of classroom assessment. You will develop the ability to discuss the purposes and uses of different types of assessments, the conditions for quality assessments, and multiple modes of classroom assessment. The video will summarize concepts from the written lessons and provide tips for writing strong assessment items.

This lesson will continue reviewing types of assessments, including:

Performance
Criterion-Reference
Norm-Referenced

Then, we will explore the factors that are used to evaluate and ensure the quality of assessment instruments:

Validity
Reliability
Bias

Types of Assessments

In this lesson, we will continue exploring a few different types of assessments that you may come into contact with during your teaching career or choose to utilize in your own classroom.

Types of Assessments: Performance Assessments

A performance based assessment, which may also be referred to as an authentic or alternative assessment, is a form of testing where the assessment is not a traditional paper and pencil test, but rather an exhibition of skills. The tasks are typically based on real-life (authentic) scenarios or are career-specific, and require the application of the requisite skills and knowledge for that task. Performance assessments are popular because the students can actively demonstrate what they know rather than select an answer from a given list. Therefore, performance assessments can feel like a truer indicator of the student’s actual knowledge and ability level than more traditional tests.

Testing is often done to the student, but performance assessments are done by the student.

Performance assessments open a vast array of assessment types for use by the teacher. For instance, performance tasks may include open-ended questions, hands-on problem solving, cartoons, experiments, inventions, musical compositions, original plays, stories, dances, essays, and story illustrations. In certain classes, performance assessments are also called lab practicums because the students are required to demonstrate how to correctly use the equipment specific to that program.

A performance assessment consists of two parts, an authentic task and a rubric, or scoring criteria. To construct a performance assessment, the teacher identifies a well-defined task that requires the students to create, make, or do something that is within the intent of the curriculum and normal classroom instruction. The teacher then creates a holistic set of scoring criteria or a rubric that is based on curricular standards. This is used so that students’ responses are assessed using the approved curriculum and the common classroom procedures.

The best performance tasks allow the students to apply an array of curriculum-related skills and knowledge in a creative manner that is a personalized response to the assessment. This also assumes that there are multiple methods for defining a correct answer. The teacher might also design the assessment so that students search out additional facts or information or try novel approaches to elaborate or extend an answer.

It is important for the teacher to do a final review of the task design before asking the students to begin working. The final review must include a thorough analysis to prevent technical constraints and/or equity issues and bias. The task must be fair to all students. Then teachers with training in holistic rubric scoring judge the quality of the students’ work based on the established standards.

Well constructed performance tasks appear more like a normal classroom activity than a test. Tasks that are embedded into and align with the intent of the curriculum demonstrate the assessment-instruction feedback loop.

The simple way to create a performance assessment is to take a proven, existing activity, create a rubric, and refer to it as a performance assessment.

There are many instructionally significant reasons to implement performance assessments as part of the assessment-instruction system. Performance assessments tend to engage the students more than traditional paper and pencil tests. As a result, the students are more diligent and show greater motivation in their preparation. Likewise, the lesson planning that leads to a performance exam tends to involve more active learning than the lecture and review type of lesson that is used to prepare students for other types of tests. Students that are actively engaged in learning tend to be more interested and motivated than students who have a more passive learning experience.

Teachers also tend to feel that the data collected from a performance task is more representative of the students’ true knowledge. There is rarely a way that students can guess the correct answer using a performance assessment and all of the information must typically be generated by the student. In this way, they separate those students who have studied and prepared for the event from those who may have taken shortcuts. Performance assessments differ from most other assessments because they allow great student freedom in constructing responses that require higher order and/or analytical thinking. Students will generally respond more creatively with greater depth and detail when they are not limited by a prompt. Another value of performance assessments is that they can be collected over time as evidence of the growth, mastery or achievement of curricular goals.

Yet, for all of the good that comes with performance assessments, there are also several items to consider during the planning stage:

The amount of time required to complete the testing sequence. Although traditional tests can usually be completed during the time allotted for a normal class session, more elaborate tasks may require multiple sessions.
The amount of time needed to score the tasks. It will most likely require a lot more time to score a performance assessment than just running a set of answer sheets through the scanner. In some cases, the scoring will take significantly longer than the actual assessment. When this happens, providing the students with immediate and informative feedback is not realistic.
Knowing how to use rubrics and holistic scoring prior to the event. Often a team of teachers will score a common assessment in a round-table format so that they have the opportunity to discuss borderline student answers and create a continually-refined and common understanding as the process unfolds. In addition, anchor papers or exemplary models are needed to help standardize the inter-rater reliability and scoring consistency if multiple teachers are scoring the same assessment.
Some students may not score well because they are not accustomed to a performance assessment. Students might not know what to expect and will not adequately prepare for the activity. In cases like this, the students may feel the tasks are unfair. To prevent this, teachers are advised to keep their instruction and performance assessments similar and to communicate that similarity to the students. It is also a good idea to inform the students ahead of time as to what the task entails and explain the standards that will be used to evaluate their performance. This will require a careful description of the elements that will be expected for a proficient response.

Performance assessments add a useful dimension to the assessment-instruction system which gives students more freedom in constructing and elaborating their responses to the stimulus. Like the other types of assessment we have covered, performance assessments can also drive future instructional experiences and provide students with the opportunity to self-reflect about their learning.

Types of Assessments: Criterion-Referenced Assessments

A criterion-referenced assessment is one that measures students’ success in reference to defined standards, or criteria. A criterion-referenced test is typically utilized in the classroom to determine how well students have mastered a particular curricular unit or standard and the students’ scores will reflect their level of mastery. The results of a criterion-referenced assessment are not determined by how well a student scored in relation to the other students taking the same exam. However, criterion-referenced assessment may be used to compare students’ results by criterion if they completed the same exam, such as a statewide exam in a particular subject area.

There are several instructional advantages afforded by criterion-referenced exams that teachers can utilize to maximize their benefits. First, since the students are working to accomplish mastery of a criterion and not working against each other, the students can work in teams, or the teacher can form cooperative groups. A lack of competition between students presents an opportunity for the students to work and study together for the benefit of everyone.

A criterion-referenced assessment also allows the teacher to clearly define the objectives and goals for the students and create instructional ladders or steps for each student to reach those expectations. The teacher should make the goals as obvious as possible and note their importance. Teachers should also provide their students with an understanding of criterion-referenced tests if they are unaccustomed to taking them. Once the goal is clearly understood by the students, they can assist in the development and implementation of instructional activities. In essence, they will know where they are and where they have to be to reach mastery, so it becomes a matter of connecting the dots.

A criterion-referenced assessment may not work in every situation. On a true criterion-referenced exam, all of the students may fail or they may all earn a perfect score. In the simplest terms, the students either make it to proficiency or they do not.

Types of Assessments: Norm-Referenced Assessments

A norm-referenced assessment, which is also known as a cohort-referenced assessment, does not measure student success against a defined standard or criteria, but against the achievement of the other students who took the test. This is also known as “grading on the curve” or “curving” the test results. When curving the test results, the top scoring students always get an “A” (or some other indication of a high mark) and the rest of the students receive scores based on how well they scored on the assessment relative to the top scoring students. That way, no matter how well or poorly the students demonstrated achievement, a set number or percentage of the students, such as 15% of the class, will always score an “A,” a set number or percentage will receive a “B,” and so on. Thus, the results of norm-referenced assessments rank students in comparison to other students who took the same test. The IQ test is one of the best known norm-referenced assessments.

Teachers use norm-referenced tests because of several unique benefits. First, norm-referenced tests are considered more fair or compassionate because they guarantee that a prescribed number of students will be successful regardless of the ability of the students, teacher, or institution. It is well known that students and parents of students who score well in a particular class will have very few negative comments to make. Thus, curving scores can minimize problems for the teacher. Norm-referenced tests are also useful whenever the teacher wants the students to understand how their scores compared to the remainder of the cohort who took the same assessment. In some cases, this is very motivational for the students. The teacher can also use a series of norm-referenced tests to move students toward a standard in such a way that the students do not feel hopeless or defeated based on the results of the test.

Recently norm-referenced tests have received criticism. By definition, a norm-referenced assessment does not measure achievement with respect to a standard, so the assessment is seldom linked to lesson planning or mastery learning. Teachers can curve a benchmark test showing how well the class is progressing and, after the test, the teacher can begin the next unit of study regardless of how well the students performed.

With norm-referenced tests, it is not uncommon to find that students will be unwilling to work together or cooperate to help other students learn. From a student’s perspective, it may seem disadvantageous to help another student in the same cohort. This type of behavior is obviously not very helpful in promoting the overall success of the class.

In response to standardized and high-stakes testing, teachers have moved away from a reliance on norm-referenced tests. With the loss of the protective umbrella provided by norm-referenced tests, teachers and schools have developed better pedagogy to differentiate instruction so that all students can learn.

Evaluating the Quality of Assessments

We have already noted the importance of and the many reasons for using assessments in the classroom. Assessments can target instructional needs, communicate valuable information about a student’s progress, and can indicate a student’s mastery of curricular content. However, this information will only be useful if teachers are using quality assessment tools. Poor assessments provide poor data, send teachers down the wrong path, and can harm students by misrepresenting their needs and skill levels. We will now discuss the conditions that are often used to evaluate and ensure the quality of assessment instruments.

Validity

There are many types of validity but the most common in educational assessments is content validity. The best definition of content validity is also the simplest: a valid assessment is one that measures what it is intended to measure. Validity is concerned with what exactly is being measured and more importantly, what the results of the assessment mean. As a silly example, a teacher would not assess the proficiency of a student regarding a social studies lesson using the results of a math test.

In the technical sense, validity refers to the appropriateness of the interpretation of the assessment results and not to the assessment instrument. As such, validity is considered in relation to the specified use of the assessment. For instance, a social studies assessment will have a high validity index if it measures what has been taught in the class. It may also have a degree of validity for predicting how well the students will do in the next social studies course. However, the validity would be suspect in determining how well the same student will do in a mathematics course.

For the teacher, the use of valid assessments is critical. Decisions regarding students, lesson planning, and special initiatives presume valid assessments. Within the classroom, teachers must be sure that the assessment of a particular curricular unit is reflective of the weight and importance given during the instruction on those particular topics. Heavily weighting students’ responses on a miniscule item is not as valid as scoring in proportion to the importance of the item. A good rule of thumb is to construct the assessments to represent an adequate sampling of the knowledge and skills taught in that course.

Testing students in a modality different from their classroom training will also have a low validity. As an example, if students were asked to write extended essay-type responses to open-ended sample questions during the instruction which allowed them the freedom to express themselves in their answer, then a multiple choice test over the same material will have a lower validity. The same would be true of a mathematics test that contained large amounts of reading material.

Never hold students accountable for content on an assessment that was not part of the instruction. It is also difficult to justify a low score on a test or test question for material that is not within the confines of the approved curriculum.

Reliability

Reliability relates to the ability of an assessment to replicate the same results. It is a measure of test consistency. A reliable assessment is one which provides the same data with the same or a similar cohort of students and is consistent in its methods and criteria. To demonstrate a perfectly reliable test, a teacher would have to give his students the same test twice and have the students get the exact same score. This example assumes that the students would not remember anything from the first administration of the test, which is highly unlikely in most classroom situations.

It should be known that all measurements, whether measuring the length of a chalkboard or a student’s mastery, are not perfectly reliable. As an example, if we wanted to determine the accuracy of an instrument for measuring the length of a chalkboard, we would take lots of measurements with that instrument. Typically, these measurements would vary slightly at the smallest increments, but would cluster around a prominent value. The cluster of measurements would be the reliability of that instrument. We can assume that the chalkboard did not change in length between measurements. However, this assumption cannot be made in educational circles because the students will always grow, mature, learn more, or change in some way. Therefore reliability is even more elusive in the classroom.

There are a number of variables that affect the reliability of an educational instrument. The students themselves will affect the reliability. They may be having a good day or a bad day which could affect their performance on a test and therefore, change its reliability. The test may be unreliable because it contains ambiguous questions, unclear or no directions, faulty scoring criteria, or is too long or short.

Bias

A biased test is one in which the presence of some characteristic may unfairly influence the students’ scores. A biased test question results in an inconsistent performance for certain individuals or groups of the same ability level, but from different ethnic, gender, cultural, or religious populations. There are typically three types of biases that characterize educational assessments: fairness, prejudice, and stereotyping.

A fair test provides everyone with an equal chance of getting a good assessment of their achievement. There are several things that a teacher can do to avoid an unfair test. The teacher can analyze each test item to make sure that:

It does not include any non-essential vocabulary that will be challenging or unfamiliar to the students
It does not present a situation that is unlikely for certain students to experience or to which only certain students may have had prior access
It is equally familiar to all students
It is not too lengthy

There are many forms of prejudice that may bias an assessment. A test question would be prejudiced if it contains content, language, or situations that offer an advantage or disadvantage to subpopulations of the class. It may also be prejudiced if it contains an item structure or format that is differentially difficult for certain students. For instance, an assessment where students are scored on how much weight they can lift is prejudiced in favor of boys.

Stereotyping is a situation whereby a test question may be offensive, contain negative connotations, or is historically-charged. Although the presence of stereotyping may not make the item any harder or easier, it may upset certain students and affect their performance on the assessment. Teachers should make sure to avoid items that create an unfavorable representation of a particular group, depict members of a population in an unfavorable light, or make unnecessary assumptions about gender roles, such as using “businessman” instead of “businessperson.”

If data is collected from a biased assessment, it will be skewed and the teacher should be very careful how the data is used, if at all.

To remove the teacher as a source of bias, it is helpful to ask neutral teachers to conduct a sensitivity review to eliminate potential bias in the questions’ construction. It also helps to have multiple scorers rate each student’s assessment, or to create anonymous student answer sheets for scoring to minimize teacher scoring bias on subjective questions.

Review Questions

What can teachers do to make sure that their traditional pencil and paper assessments are valid, reliable, and unbiased?
How might these steps change for a performance assessment?

⬅ Previous Lesson Workshop Index Next Lesson ➡