Winter Milestone 4 Team BPHC : Quality of Task Authorship Research Proposal (Science)

From crowdresearch
Jump to: navigation, search

Quality of task authorship has emerged as a key issue in the field of crowd sourced work. Workers on crowd sourced platforms like Amazon Mechanical Turk are concerned with how well the requester communicates the task at hand. Often times, workers complain that the Human Intelligence Tasks (HITs) posted by requesters are not clearly outlined (can be vaguely worded or poorly designed).

Unclear instructions may lead to the HIT being misunderstood by the worker and subsequent rejection of the submitted work by the requester. This is an unfair loss of time, money and approval ratings for the worker. Keeping this problem in mind, workers attempt to resolve any ambiguity about the task, (eg: through email on AMT) prior to beginning work. Using external forums like TurkerNation, workers oftentimes help each other out by identifying and sharing HITs posted by requesters known to offer well designed and quality tasks. Clearly, the quality of the task authorships affects the quality of the work performed.

Research Focus : Factors Effecting Quality of Task Authorship

In this proposal, we aim to delve deeper into the issue of quality of task authorship by trying to understand what factors constitute a poorly designed task that may require communication/feedback (that should ideally be avoided) from the worker, or tasks that ultimately lead to frequent rejection.

Specifically, we want to determine if experienced requesters author tasks any differently from new requesters, and similarly if experienced workers perceive the quality of tasks any differently from new workers. We would also like to extend this experiment to the Boomerang System being implemented on the Daemo Platform. In the Boomerang system, in addition to experience, we would like to see the impact of reputation/ranking on the quality of task authorship.



Our expectation is that experienced requesters and new requesters are equally likely to author tasks of poor quality (i.e, both are equally likely to reject or receive considerable negative feedback from a group of workers). This prediction is based on the existence of an immense variety of HITs posted on platforms like AMT.


New workers, who stand to suffer a greater decline in approval ratings may be more concerned with rejection, and may more frequently tend to give feedback or communicate. We predict that new and experienced workers will respond similarly to poorly designed tasks, and that new workers will seek clarification and may perceive lack of clarity even in well designed tasks owing to inexperience.

Experiment Design

We have split our experiment into two parts:

1. Studying how task quality is influenced by requester experience in task authorship and worker experience.

2. Studying how new and experienced workers respond to poorly and well designed tasks.

Part 1

Part 1

Three groups will be involved in this experiment.

  • There will be two groups of equal size:

One of new requesters with little or no experience in postings HITs on crowd-sourcing platforms like Amazon Mechanical Turk (example: 0-6 months), and one of experienced requesters who have been using the platform for considerable amount of time (example: greater than 1 year). Alternatively, we may choose to do this classification based on the number of HITs posted. For example, new requesters (<50 HITs posted) and old experienced requesters (>150 HITs posted).

The third group will be an equal mix of amateur and experienced workers. As experimenters, we will be aware of the experience of the worker, but to requesters posting the HITs, the workers appear as a randomized group. This enables us to generate additional information regarding the relationship between different pairs of group such as (experienced worker, experienced requester).

  • As experimenters, we will give the same task to be posted by both groups of requesters. For example, we will ask the requesters to post a HIT related to image annotation for a given dataset. Each requester, new or old, will post the HIT on the platform.
  • The group of workers will now work on the HITs posted by the two groups of workers. We will give the workers the option to email the requester regarding any unclear/poorly designed HITs.
  • The number of emails received by each requester in each group as well as the number of HITs successfully completed and the number of rejected HITs will be counted for each requester.
  • Steps 2 - 4 will be repeated with different types of tasks.For example, we next assign a translation task to be posted by the requesters, then a multiple choice survey and so on. This is to account for the variety of (and hence different levels of difficulty in) the tasks encountered in an actual crowd-sourcing platform.
  • Using the data generated, we will try to establish the relationship between the experience of the requester and the quality of the HIT posted (measured by number of feedback emails and rejection rate of requester).

Part 2

Part 2

This experiment is designed to understand how the experience of workers affects their response to poorly and well designed tasks. We want to understand if experienced workers are less likely to seek clarity about tasks because they've become accustomed to poorly designed tasks or if such clarifications are sought by new and experienced workers alike.

The experiment:

  • Identify a task that has been posted in a poorly designed and a well designed form. The task can be picked from old tasks posted on Amazon Mechanical Turk. A task that has received much negative feedback (from a source like Turkopticon or Reddit) is chosen as a poorly designed task and a similar task by a requester with a high Turkopticon rating is chosen is the well designed task. There is a significant element of subjectivity involved and the tasks must be chosen carefully.
  • Post the poorly designed task and the well designed task, open to a pool of workers consisting of an equal number of experienced and new workers. (The definition of new and experienced will be done as described in the previous experiment).
  • Record the number of emails sent to the requester from the workers, the number of tasks that have to be rejected (in the context of the experience of the worker).
  • Repeat this experiment for multiple types of tasks (roughly 5), to account for the possible differences in ambiguity/difficulty in different categories of tasks. For example, labeling an image requires a different skill from translation of text.


The goal of the experiment is to understand how much the experience of a worker matters with respect to quality of task design. To draw conclusions about this, we intend to analyze any possible relationship between these aspects, for example, new workers may have issues or will have to seek clarification for both well and poorly designed tasks or experienced workers may react much more sharply to poorly designed tasks in comparison to newer workers.

Extension to Boomerang Platform

Apart from running this study on existing crowd-sourcing platforms like AMT, we extend this activity to the Daemo platform. Here, the experiment can customized to the Boomerang Ranking System by substituting experience with the rank given to requesters and workers. The two parts of the experiment can be conducted with both Prototype tasks and full scale tasks and generate information to show how well feedback between different groups in the Prototype stage improves quality of work when the full HIT is posted.

Interpreting the Results

A correlation measure could be used to measure how closely feedback (emails send) is related to experience of the requester. For example, the Pearson coefficient could be calculated between the number of emails sent and experience (measured in months or number of past HITs posted). As per our hypothesis, we expect a Pearson correlation coefficient close to zero (implying little or no correlation between the two data series).

This data can be useful to accurately determine what needs to improved in the platform. For example, if it is revealed that newer workers often seek clarification and claim lack of clarity in a task when compared to experienced workers, the follow situation may arise:

A requester receives many emails and requests to clarify a task, which would indicate that the task is designed. However, the reality may be that the task was attempted by many new workers, causing the flurry of emails and may not necessarily indicate that the task design was of poor quality.

The goal of the experiment is to be able to draw such insights, which will help improve the platform.

Milestone Contributors

@aditya_nadimpalli , @sreenihit