Measuring variations in requester quality based on Efficacy of Requester-Worker-duo

From crowdresearch
Jump to: navigation, search
  • Requester Variation Determines Result Quality in Crowdsourcing*


{credits:- Prof. Michael } The dominant narrative in crowdsourcing is that low-quality workers lead to low-quality work, and high-quality workers produce high quality work. The result is that most techniques focus on identifying high quality workers. We hypothesize that requesters vary significantly in their ability to create high-quality tasks, much as user interface designers vary significantly in their ability and training in UI design. We performed an experiment wherein 30 requesters each authored ten varied crowdsourcing tasks, and workers on Mechanical Turk were randomized to complete one requester per task type. We found that while 20% of the variance in result quality is attributable to worker variation, fully 35% of the variance is attributable to requester variation. We introduce the concept of prototype tasks, which launch all new tasks to a small set of workers for feedback and revision, and find that it reduces requester variation in result quality to one-third of its baseline amount.

Study Introduction

Study Method


{paragraph 1, description of Daemo and the general worker population}

  • "Method specifics and details"

The basic set of tasks chosen appear under the following primitives: Binary, Analytical, Mathematical, Search, Behavioural motives, curiosity and exploration, scientific analysis. The above mentioned set of primitives have been reported in crowdsourcing systems literature [Ref2] and on various crowdsourcing task types [ref 3]. The choice of these primitives was based on the necessity to understand the effect of requester task authorship variation is juxtaposition to the skill of the workers and the result quality. In some cases, we generated the necessary questions to better portray the skill of the worker for benchmarking purposes Ex: SEARCH results were compared to the reporting in [ETA paper, ref 1]. The necessary data for these primitives was obtained from various sources such as SVHN dataset [ref 4], CalTech pedestrian dataset [ref 5] and an amalgamation of pictures & resources collected of the web. The results for the generated questions have been cross checked thoroughly to make them devoid of any unintended ambiguities.

M5 karthikpaga Tasks.JPG

  • Experimental Design for the study

Intrinsic to the mixed series of 7 primitives are a couple which have been designed to account for the time limit (as posted by the requester) and time taken by the worker (based on the recordings). The binary task has been designed to account the null hypothesis i.e., Requester task authorship variation has no deterministic effect result quality in crowdsourcing. For example, which of the following would be a suitable birthday gift: option (a) construction brick, (b) a bouquet of flowers. Such tasks have been reported to require very minimal effort as a requester in addition to the report that such tasks are very minimally error prone alongside having the worker to complete the task quickly [ETA paper, ref 1]. IN case of times tasks the exact time limits were initialized using how long workers took when not under time pressure. For each worker we presented a cascade of tasks such that they were randomized to complete one requester for each primitive. In order to maintain the random combination of requester and workers duos in the marketplace, none of the workers were informed about the specifics or general overview of the tasks prior to the commencement of the experiment. The tasks were presented in randomized order, and within each primitive wherever necessary the time conditions were presented accordingly. Workers were compensated $2.00 and repeat participation was disallowed. But in order to better understand the affects of presenting the cascade of tasks with an overview, workers were allowed to publicly comment on the platform in order to assist the next batch of workers, the results of which would be vastly applicable, to scientific tasks and surveys etc, where the requester demands for a large diversity of worker populous. A single task was presented on each page, allowing us to record how long workers took to submit a response. Under timed conditions, a timer started as soon as the worker advanced to the next page. Input was disabled as soon as the timer expired, regardless of what the worker was doing (e.g., typing, clicking).

For each of the cascade a log of the following parameters was collected:- worker_id, requester_id, truth, worker_reponse, source (in case of SEARCH task), time spent, time allotted, task_id, project_id, comments(for specific tasks), request an example (wherever applicable and available), clarity of task instructions (reported by worker), approval rating of the worker (as provided by Mturk), requester rating (as provided in turkopticon), difficulty (rating on scale of 0-10, time spent by requester to author the task, presentation quality (good vs bad), request feedback and revision ([yes vs No] as deemed necessary by the worker), simplicity of task, and details to compute the Task load index.

  • Measures from the study

The information we logged allowed us to calculate behavioural measures for each primitive: – Efficacy of worker requester duo. The EWR is a value between 0 to 1 which represents the system efficacy for each set of worker and requester in the project. Although the mean value of the EWR can be attributed to the extremes of the spectrum notably exceptional detailing of task & low-quality worker and lack of detailing & high quality worker, we observed that knowing the exact task (in other words, how hard the task is) can account for 36% of the variation. Knowing the worker gets you an additional 21% of the variation beyond knowing the task. Knowing the requester gets you an additional 12% of the variation beyond knowing the task.[estimates based on the ETS experimental data]. We measured the obtained response against ground truth for each primitive. If there were many possible correct responses, we manually judged responses while blind to condition. Automatically computing distance metrics (e.g., edit distance) resulted in empirically similar findings. – Time. We measured how long workers took to complete the primitive without any time limit. After each task block was complete, we additionally asked workers to record several subjective reflections: – Estimated time, clarity of instruction, comments, worker feedback and any request for revision. Time estimation has previously been used as an implicit signal of task difficulty [5]. – Relative subjective duration (RSD). RSD, a measure of how much task time is over- or underestimated [5], is obtained by dividing the difference between estimated and actual time spent by the actual time spent. – Task load index (TLX). The NASA TLX [10] is a validated metric of mental workload commonly used in human factors research to assess task performance. As reported by [ETA paper, ref1] that rankings cannot quantify small changes in effort, we analyse the results based on a binomial linear regression model. {Thanks Prof. Michael for helping with the analysis}

Worker requester duo efficacy.jpg

  • What do we want to analyze?

We have captured effect of the factor leading to the result quality on the basis of results obtained and the EWR as a metric to better understand the effect of the task authorship quality of requesters on each worker. The task related to the binary primitive was used as a metric to comment on the efficiency and attentiveness of the worker. The vital observation in the EWR is the inter-columnar variations for a given worker & the consistency in these variations in other strata (strata- each worker in the matrix) which directly translates to variations in the quality of task authorship by a requester.


@karthikpaga @michaelbernstein