WinterMilestone 3 stormsurfer ReputationIdea: Sample HITs for the Worker
Describe (using diagrams, sketches, storyboards, text, or some combination) the idea in further detail.
Did I do the task right? Partially right? Completely wrong? There is no way for me to know! During WinterMilestone 1 stormsurfer, even after two hours, my submissions (all 20) were still pending. I completed 20 HITs for the same task, and it is possible that I might receive $0.00 for all of my work if it is incorrect. If I received feedback after my first 1-2 HITs, I can easily improve and complete the task better the next few times. However, it is difficult for the requester to give immediate feedback to each and every worker; he/she currently may not be online to approve the HITs or may be swamped with reviewing other HITs from other workers.
The bottom line: workers don't know what a requester is looking for; open-ended tasks will always have a level of interpretation open to them, and each requester has his/her own scale for determining what is an "accepted HIT" vs. a "rejected HIT." Likewise, it is important that requesters are consistent when "grading" HITs: the same quality of work should always either be accepted or rejected. The goal, therefore, is to give workers an idea of what quality of work requesters are looking for, to give workers an idea of what quality of work other workers are submitting (as stated during the meeting, Bayesian truth serum, or predicting others’ responses, will improve quality of work; workers should be able to more accurately predict others' responses), and to hold requester accountable for their "grading scale" (i.e. ensuring that they are consistent among HITs).
Pretend that tasks are "assignments" or "tests," requesters are "teachers/professors" or "graders," and workers are "students." As a high school student, this reminds me of the free response sections on AP/IB exams and the essay section on the SAT/ACT. Students come from a variety of different backgrounds, and what may be an "amazing" essay/answer for one student may be a "horrible" one for another; to the grader, the essay/answer may be perceived as simply "mediocre." The problem of standardization comes up: what qualifies as a 12/12 on the essay (for the SAT/ACT), and what qualifies as a perfect score, or a sub-par score, on an AP/IB exam?
Fortunately for students, College Board (as well as other standardized test companies) releases sample/example responses to questions. For example, on the SAT Critical Reading section, there is always a sample question at the beginning of each section:
Now, students get a basic idea of what the graders are looking for. That's why, when studying for the SAT/ACT, students often take a look at several released exams: they want to be synced with the test and the grader. But these sample tests released by College Board only contain right answers--what about "partially right" and "wrong" ones? College Board, for the SAT essay and AP free response questions, releases sample student responses for all three categories ("completely right," "partially right," "and completely wrong"). Here is a sample student response, for example:
The above image shows part of a completely correct response, but College Board also gives responses for the other two categories. Not only does it give a score to each student response but also a reason for why an incorrect response is wrong. As a student, I can look at these responses to gain a better understanding of why my answers (during the test, but for a different question) may be marked incorrect. For example, sometimes, AP graders have strict rules on what must be part of a correct answer, and it is quite possible that on MTurk, requesters might too.
Now, what if MTurk had a similar feature enabled? Something like this:
Clicking the boxed link would allow the worker to view sample responses (HITs) for the task. These include 1-2 sample responses created by the requester (similar to the SAT sample response created by the "grader"), which give workers an accurate idea of what quality of work requesters are looking for, as well as a random sample of HITs (not chosen by the requester) submitted by other workers (similar to the AP sample response). This sample of HITs should include those that were both accepted and rejected as well as the comments that the requester gave (especially if the HIT was rejected) to the worker.
The two worker-side goals are now satisfied: I now know what quality of work the requester is looking for because he/she gave sample responses for what constitutes an "accepted HIT," and I know what quality of work other workers are submitting and how the requester felt about their work. Before even submitting a single HIT, I have an increased level of confidence that my work should be accepted because work of a similar (or perhaps lower) quality was accepted.
Furthermore, now the requester has created a standard for the quality of the work that is accepted/rejected. Because workers are able to view previously submitted work, they can voice their concerns to a requester if they feel that they achieved the quality of previously accepted work when in fact their work was rejected.
Slack usernames of all who helped create this wiki page submission: @shreygupta98