Assessment Evidence

I. PPT Presentation:  decker_flt808_finalassessmentpresentation-1

II. Live Presentation

III. Written Report:

Kenwood Academy Mandarin 2 Test Modification, Administration & Analysis

Lindsay K. Decker

Michigan State University



This report details the modification, administration, and analysis of Mrs. Qiong Chen’s Mandarin exam. The exam is based primarily on textbook content and workbook practice from the Revised Edition of the Far East, Chinese for Youth, Mandarin textbook series. After observing her class for two days, I was able to make minor alterations to the original test. I administered the test to Mrs. Chen’s students in her classroom at Kenwood Academy in Chicago, IL, on April 13th, 2016. Finally, I conducted an item-analyses of the test results and gave recommendations for improvements in test development and instruction.

Keywords: [Construct Validity, Content Validity, Assessment, Item Analysis, Face Validity, Reliability, Practicality, Impact, Consequences]



Kenwood Academy Mandarin 2 Test Modification, Administration & Analysis

 Because I am not currently teaching, I reached out to another Chinese teacher who gave me some of her professional, personal, and instructional time to modify/evaluate an existing assessment. It took two months before I heard from the one and only respondent to my project plea-for-help email, Mrs. Chen. We met briefly after school one day to discuss assessment possibilities.  Her students only had one week of school left before two weeks of Spring Break. There was also the complication of CPS testing and parent-teacher conferences going on in the same week. In short, she had minimal instruction days, and a unit test was coming up. There was not enough instruction time for an additional test, and so the upcoming unit test was the only feasible option for my final project.

Mrs. Chen explained to me that her students consistently struggled with her unit tests. She often slashed content or simplified tasks to prevent students from failing. She also felt pressure from the administration to avoid failing students. The upcoming unit 4 test was made up of items identical to in-class and homework exercises that they had practiced extensively. Assuming she was actually teaching what she said she was, then the unit test should be an accurate indication of learning and teaching success. It could have also been an indication of students’ ability to memorize. In the end, Mrs. Chen preferred not to make major changes to a test that would have a significant impact on their class grade. I asked if I could observe her class anyway and decide if there were any minor modifications that could be made to improve it.

I was not able to observe a lot in the 2 days before exam day. They were mostly working on a review for the test. However, it was during observations that I realized 1) how much her students struggled with verbal and written instructions, and 2) how they used their textbook as a crutch. These two observations constitute the rationale behind the minor modifications made to the assessment itself, the administration process, and the short-answer item rubric.


As I stated in the abstract, Mrs. Chen created her original exam using the Far East textbook and workbook, including only tasks which students had practiced in class or for homework. She explained that before writing any of the chapter lesson plans, she always created the chapter exam first. This way she would know what and how to teach. Assuming that she really did teach what and how she was testing, I made a list of objectives describing what the students should be able to do based on her original chapter exam. Next, I observed her class on two different occasions. On both days I was only able to observe her students completing their test review. By comparing the objectives of the test review to the test itself, I made several minor modifications to the test. I wanted to make sure the assessment actually assessed its objectives and that the students were only tested on what they had practiced in class. I will address the specific modifications later on in the report.

Target Population 

This test was formatted for students at the high school level in their second year of Chinese. Based on my general observations over two days, the students’ speaking proficiency levels ranged from a Novice Low to a Novice Mid on the ACTFL proficiency scale. It was harder to assess their writing and reading proficiency. Most of them were completing their review sheets with the assistance of a textbook.


This test is a summative assessment of students’ overall proficiency in two communicative modes: reading and writing. While the test content is purely weather related, it specifically assesses knowledge of vocabulary and grammar through writing and reading skills. The overall purpose of administering this type of test was to assess which skills and subskills her students have mastered and those that they have not. Mrs. Chen could use this data to evaluate what to reteach and review. The test also recycles previously tested content and thus can serve as both a review and motivating reminder to students of what they can already do.

Impact & Consequences

Washback is “the effect that tests have on learning and teaching” (Hughes, 54). For example, tests should be able to motivate students to learn and teachers to effectively teach the specific language abilities prescribed by the test construct.  However, if insufficient weight is given in the scoring of one or more of those abilities, relative to others on the test, then the test maker is not encouraging positive backwash of that particular language ability (Hughes, 54). This test emphasized vocabulary recall over grammar, and integrative reading and writing tasks. The other tasks are worth more than vocabulary items and require more time. Unit exams are weighted at 20% of the students’ overall course grade. This particular chapter exam was worth 95 points. For all of the above reasons and for timely feedback, both students and teachers should be able to clearly identify areas for improvement in learning and instruction.

Overall Design

Language Skills

This is a integrated skills test, covering grammar, vocabulary, reading and writing. Students are expected to write, translate, match vocabulary words to definitions, compose sentences, and read and answer questions about a passage in Chinese.


Specifically, this test examines students’ ability to read and write about weather conditions, and the verbs, grammar, frequency words, and timing words used to describe and compare weather in different locations.


Students were tested on their ability to do the following for this exam: 1) Identify the following characters given pinyin (phonetic spelling or pronunciation): 天气,太阳,很热,晴天,阴天,多云,少云,刮风,下雨,下雪,打雷,温度,最高,最低。2) Write the pinyin (phonetic spelling or pronunciation) given any of the same characters. 3) Write a sentence describing what the weather “will” be like on a given day. 4) Write a sentence comparing the weather between two different places 4) Write the English definition of different types of weather. 5) Identify the verb used with specific weather words. 6) Make sentences using the following verbs or grammar structures: 忘了,非常,没有,会7) Describe the weather today 8) Tell whether or not you like certain kinds of weather? 8) Answer questions about a reading passage with past and present vocabulary?

Item Tasks/Types

Item tasks include filling in the blank with a word from a word bank, translating English sentences into Chinese, making complete sentences using a given term, filling in the definition of vocabulary words in English, answering questions in Chinese in complete Chinese sentences, circling the correct missing word in a sentence, matching the Chinese vocabulary word to the correct English definition, and answering questions about a Chinese reading passage in English. The breakdown of task items by skills is the following: 50% test is vocabulary (items in blue); 20% is grammar (green); 20% integrated reading (yellow), and 10% is an integrated writing portion (pink). I think there is a good distribution of skills on this assessment. It is practical to build up the vocabulary of your students at the early levels, and to concentrate on mastering fewer grammatical concepts. Writing should also be secondary to reading.

Test Item Modifications

It is my personal teaching philosophy that as language teachers we must maximize class time to teach real-life skills. In my experience, it is not practical nor is it necessary to teach students how to write all the characters. While studying graduate level Chinese in the U.S. and in mainland China, I observed that students rarely write characters by hand. They are most often seen texting or typing pinyin and selecting the corresponding character when it pops up. Matching pinyin to the correct character is a much more authentic sub-skill of the reading construct and more consistent with what Mrs. Chen’s students practiced during class. It was not evident during my observations that students had ever been expected to write the character from memory given only the pinyin. Mrs. Chen also confirmed that she had never assessed them in that way before. This is why I chose to add word banks to sections 1 and 3. Please see the assessment attached to this report. These word banks were also necessary for students translating and writing complete sentences in Chinese for short-answer items as in sections 3,6, and 7. Again, I say this was necessary because these types of tasks in classwork and homework were all done with the help of a textbook previously. No expectations were stated neither were there assessments given that indicated otherwise.

Lastly, I observed that students in Mrs. Chen’s class generally and consistently could not follow directions. I foresaw this as a problem on the exam and especially for the administration of the exam. I came up with a strict administration routine and modified the directions of the test to make the expectations and scoring of the tasks more clear. Please see the assessment attached to the end of this report. Directions were modified in section 1, 3, 6, 7, 10. As for the remaining test items, I thought they were attached to objectives that agreed well with the review and thus saw no reason to change them. I also did not feel that I was in a position to change them given that the teacher created the test as her set of objectives, and based her teaching on them.


Overall the test is worth 95 points. Fill-in-the-blank items, matching, and circling the correct terms are worth 1-2 pts each. Reponses that require complete sentences in Chinese are 3-4 pts each. Students are given partial credit for correct information that does not meet the exact demands of the task. For example if the student spells the pinyin correctly but writes the tone wrong, or if the person correctly translated individual characters but did not put them in the correct word order. In instances like these, the person will receive partial credit. It is important to note that if the mistake was part of the tested objectives for this chapter, partial credit was not given.

Task items that required an integration of skills within a communicative mode were weighted heavier than those that tested one skill only (see sections 1,2,4,8,9). For example, the items that required students to build a sentence using a given word or answer a question in a complete sentence using characters, or translate a sentence into Chinese (see sections 3,6,7). These all required students to read, translate, and construct a grammatically correct sentence. When answering a question posed in characters with a sentence in characters, sentence structure for response is built into the question. The same goes for items requiring students to translate. Much guidance was given in the prompt to help them translate. Both of these types of tasks, were worth 3 points (see section 7). For the items in which you had to create a sentence from scratch, or read a passage in Chinese, students needed a knowledge of present and past unit vocabulary and grammar. These were worth 4 points (see sections 6 and10).


Students went over the exam with Mrs. Chen in class once all of the students had taken the exam and once all   had been turned in. Class time was given for corrections immediately after students received their graded test papers. This test was returned to students the next school day. Students received a significant boost to their grade from an opportunity to do corrections. They were allowed to receive up to 20% points in corrections. Most students took advantage of this. Once all corrections were handed in, then the test was reviewed with the students. Special attention was given to tasks that a majority of the students struggled with. Time was left at the end for questions about specific items and/or grading.


The test took an entire 50 minute period. 5 minutes were spent collecting homework, separating desks, and clearing their workspace of everything but a pencil. 10 minutes were devoted to explaining directions and administering the test. Since I made the modifications, we decided it would be best for me to administer the paper test. I explained changes I made and what I was looking for in each section. Then I left 2 minutes for questions. The students were supposed to have 35 minutes to complete the exam. Since there is no talking allowed once the test had begun, students were instructed to raise their hands individually and either I or Mrs. Chen came around to answer questions.

Analysis and Evaluation


The average score on the test was a 78.4%. The class has 27 students, but two of the students were unable to complete the exam due to special needs. Out of 25 students who took the exam, 6 received A’s (90-100%), 8 received B’s (80-89%),  4 received C’s (70-76%), 6 received D’s (60-69%), and 1 received an F (59% and under). In comparison to past unit exams, these results showed huge improvements, not only in the number of students passing the test, but also the number of students who received a C or above. About 2/3 of the students who took the test received a C or above. Students were most successful on the vocabulary sections of this exam. These constituted for about half of the test. They performed the worst on the integrated writing tasks, especially those without a Chinese question or guided prompts for translation. This would make sense given that they use their textbook as a crutch for everything. I recommended both of these results to Mrs. Chen, congratulating her on her teaching of vocabulary, and highlighting the need for students to practice creative production of sentences in Chinese. This also means more frequent informal and formal formative assessments of their abilities to write independently without the textbook.

Reliability describes the likelihood of a test to produce the same results if it “had been administered to the same students with the same ability, but at a different time” (Hughes, 36). With carful construction, administration and scoring of the test, test makers should be able to obtain similar results across different testing situations. However, if the test takers are not familiar with any aspect of the test than, “they are less likely to perform well than they would do otherwise.” Thus the results of the test would be unreliable (Hughes, 47).  Mrs. Chen’s students should have been very familiar with the test items, as they were taken from classwork and homework exercises. One student in the middle of the test shouted for joy that he knew the answers because he remembered having done something similar for homework. The modifications I made in the instrument and administration instructions may have increased the overall reliability of the test because students knew what was expected of them.

Practicality in testing begs the questions, is the test “easy and cheap to construct, administer, score and interpret?” (Hughes, 56). Tests should be cost and time effective so that teachers and students do not lose motivation in their efforts to learn and to teach. However, if too much concern is given to time and cost in testing, than tests that create real positive backwash may be avoided. For example the introduction of a new test, intended to positively change the way reading is taught, is considered impractical because it requires significant time and money needed to train teachers. As a result, teachers and students will “waste…effort and time…in activities quite inappropriate to their learning goals” (Hughes, 56). It was impractical for me to introduce a new test within the constraints of the Kenwood Academy schedule. Luckily, the  Mandarin 2 test was both easy to administer and appropriate to  Analysis of Results

The test has some construct validity. Construct validity refers to the degree to which a test actually measures what it intends to measure. Within a test construct are several underlying language abilities. In the reading sections, students were required to integrate their translation, grammar, and vocabulary skills. These are all essential elements of learning to read. It seems common sense to say that if there is a section on the test intended to measure each of the sub-abilities of the test construct, than the test has construct validity. On our test we integrated these subskills. Perhaps it would be better to test them in at least one section as a separate entity. However, without “extensive samples” of the writing ability of the group to whom the test is administered and confirm” then there would be zero evidence that the test is accurately measuring reading ability (Hughes, 31).

A test has content validity if “its content constitutes a representative sample of the language skills, structure, etc. with which it is meant to be concerned” (Hughes, 26). In other words, a grammar test should be made up of items related to grammar. However, without a “specification of the [grammar] skills or structures, etc.” to be tested, made prior to test construction, than the test content may not reflect the true purpose of the test (Hughes, 26). It may include grammar items, but not a representative sample of the skills it was meant to test, and therefore would content validity. The test tested grammar in isolated tasks as well as integrated skills tasks. I do think the test needs more writing about the weather with visual aids. Another option to increase content validity as well as authenticity for writing, would be to have students type their answers for the short answer section. The reading content spanned reading individual characters, grammar words, questions, and longer passages.


It would be interesting to see if the students would have been more successful if they were asked first to identify the correct translation of the question in English and then tried to compose the answer in Chinese. This would allow the teacher to examine students’ mastery of the sub-skills separately. The section that was given the most weight was apparently straight from their homework. This seems encouraging, but I gathered that most students may have memorized answers from the homework. Thus, I am not sure they truly comprehended the passage or not. However the other items on the test were also found verbatim on the worksheet. Perhaps if they   used to getting review sheets with the exact test items they are inclined to devote less time to studying. Several students also seemed to get anxious during the test and a few others ran out of time, or shut down completely. I am not sure what we could have done more to lower the affective filter in this  .




Assessment Instrument           中文二                                                       Period:_________



一.写汉字 xie3han4zi4: (1 pt. each)

Word Bank: 刮 ,多,最,大,风,天,气,高,云

  1. Tian1 qi4 ___________ ___________


  1. Zui4 gao1 __________ ____________


  1. Duo1 yun2 __________ ____________

Cloudy/Lots of Clouds

  1. Gua1 da4 feng1 __________ ____________

(Strong Wind)

二.写拼音 xie3 pin1yin1:(1 pt. each; missed tone marks -1/4 pt, missed pinyin ½ pt)

  1. 温度 ___________ ___________


  1. 最低 ___________ ___________


  1. 下学 ___________ ___________


  1. 晴天 ___________ ___________

(Fine day)

  1. 三. Translate to Chinese Sentences: (3 pts. Each)

Word Bank: 天气,太阳,很热,晴天,阴天,多云,少云,刮风,下雨,打雷,温度,最低,今天,明天,冷,

  1. It will snow tomorrow (Use the word, 会 )


  1. Hong Kong () is not as cold as Chicago (). (Use the sentence structure: …….没有 …….. )


四.Translate to English: (2 pts. each)

  1. 晴天 __________________________________ 6. 多云 _____________________________
  2. 阴天 __________________________________ 7. 少云 _____________________________
  3. 下雨 __________________________________ 8. 刮风 _____________________________
  4. 下雪 __________________________________ 9. 太阳 _____________________________
  5. 打雷 __________________________________ 10.很热 _____________________________
  6. 五. Fill in the verb for the weather words: (2pts each) (打,下,刮)
  7. ___________雪 2. ___________雨 3.___________大风   4.____________雷

六.Make a sentence in Chinese hanzi, using the given word .(1/2 Credit ONLY for pinyin)

  1. 忘了______________________________________________________________________________
  2. 非常______________________________________________________________________________

七.Answer the questions in Chinese hanzi using complete sentences: (3 pts each; pinyin ½ credit)

  1. 今天天气怎么样?


  1. 你喜欢不喜欢冷的天气?


八.Circle the correct words: (2pts each).

  1. 今天天气很好,今天是(晴天, 阴天)。
  2. 今天天气不好,今天是(晴天,阴天)。
  3. (现在,有时候)在下雨。(It is raining).
  4. 今天(最高,最低)温度是华氏32度,(最高,最低,)温度是华氏12度。
  5. 明天有时候刮(大风,太阳)。

九.Match the meanings: (2pts each).

________1. 很少                                                                                A. never

________2. 有时候                                                                            B. seldom

________3. 从来不                                                                            C. often

________4. 每天                                                                                D. sometimes

________5. 常常                                                                                E. Everyday




十.Read the passage and answer the questions in English.

一年中我最喜欢七月和八月,因为这两个月我们不上课,没有功课,而且天气也很好, 很少下雨。我天天都可以出去玩。我有很多朋友。他们都是我的邻居,所以他们可以走路来我家玩。我真高兴。

  1. What months does the writer like the most? ______________________________________________________________________________
  2. Why does the writer like these months? ______________________________________________________________________________
  3. Who often comes to the writer’s home? ______________________________________________________________________________
  4. Are their homes close to the writer’s home? ______________________________________________________________________________
  5. How do you know? ______________________________________________________________









Stu Grd Per new 1) /8 2) /4 3) /6 4/20 5) /8 6)/8 7)/6 8/10 9/10 F/15
1 A- 100 100 8 4 6 20 8 6  6 10 10 15
2 A 100 100 8 4 6 20 8 8 6 10 10 15
3 A 95 97 8 4 5.5 19.5 8 5 5.5 10 10 15
4 A- 91 91 8 4 5.5 20 8 4.5 3 10 10 13.5
5 A- 91 100 8 4 6 20 8 8 6 10 10 6
6 A- 90 90 8 2.5 2.5 20 8 4 5.5 10 10 15
7 B+ 89 89 8 2 5 20 8 5 6 6 10 15
8 B+ 89 100 8 3.75 5 16 8 3 6 10 10 15
9 B- 83 100 7 2.75 2.5 13.5 8 7 6 8 10 9
10 B- 84 100 8 3.5 0 20 8 0 6 10 10 13.5
11 B- 82 100 8 3.5 5.5 16 8 4 6 8 6 13.5
12 B- 81 91 8 0 0 18 8 6 1 5 0 14.5
13 B- 82 100 8 .75 4 14 8 5 5 10 10 12
14 B- 81 101 8 1.25 5 12 8 8 5 10 10 4.5
15 C+ 77 97 8 1.5 1.5 16 8 0 5 10 10 13.5
16 C+ 77 92 8 3.25 5.5 15 8 0 2.5 6 10 15
17 C 75 93 8 3.75 4.5 16 8 0 5 5.5 10 6
18 C 73 91 8 3.5 2.5 14 8 0 5.5 8 10 9
19 D+ 68 90 6 1.5 5.5 12 8 4 2.5 10 10 5
20 D+ 69 69 6 3 3 18 8 0 0 10 4 12
21 D+ 68 83 8 2.75 5.5 18 8 0 0 6 10 9
22 D- 62 81 8 3.5 0 18 8 0 0 8 10 3
23 D- 62 72 8 1.5 0 14 8 0 0 10 10 7.5
24 D- 62 80 8 3.5 0 10 8 0 0 4 10 15
25 F 29 29 2 1 3 8 4 0 0 4 6 0



Blue: Vocabulary

Green: Grammar

Pink: Writing (Integrated/Reading/Grammar)

Yellow: Reading (Integrated/Translation, Grammar, Comprehension)






















Hughes, A. (2003). Testing for Language Teacher. Cambridge: Cambridge University Press.

Wu, W.L.,Tsai, H.L. (2012). Far East, Chinese for Youth. Taiwan: The Far East Book Co., Ltd.




Tables-Item Analysis in Progress



Item Analysis-In Progress



I’m just curious- how does the corrections procedure work?  It may not matter to the overall objectives of this paper, but I thought it was important to know in terms of validity.

Great Question…I will clarify!

Concerning the peer review sheet, I did not see the original document or the new document, impact & consequences is understood, but not clearly stated, information on reliability, validity, and fairness is also not present. I am very intrigued by the changes made to the test, and I think showing the two documents would be very beneficial to help us understand how changing the format, but not the content, helped improve scores. Your language is clear and easy to understand, which makes reading your report enjoyable.

Thanks Stephanie. That is a great point. My presentation talks works these points in a lot better. Its sort of a revised rough draft. I will definitely include the testing instrument and the modified test as well.

Overall, I think you strongly presented the rationale for modifying the test.  I appreciated the specific scores to see that your modifications helped, but it would be helpful to know exactly what you modified for each task by showing both the original and new tests in your paper, and of course describing this as well.  Also, I think your paper would benefit from including validity and reliability arguments, as I did not see that presented.  You do give a suggestion for future improvements, but if you tie this in with validity/reliability, I’m sure you could find more.  There’s always room to improve assessments!  Great start overall though!

Thanks Lia, those were all super helpful suggestions. I hadn’t gotten around to building my sources into the paper. Do you think its worth it on this type of test to do an item analysis or do you think its enough to just display the test?

I think it depends on whether or not it would illustrate your points further.  My assessment was short and I only tested it with two students, so it wasn’t necessary.  Your data might work with an item analysis to easily analyze certain aspects of your test.

Maybe describe each of these parts in your analysis section- it helps to see what items specifically were difficult.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s