Difference between revisions of "WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship"

From crowdresearch
Jump to: navigation, search
Line 1: Line 1:
== Methods (for task authorship write up) ==
+
We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.
  
We're going to borrow methods section from this paper as an example: [[:Media:2015 eta (private).pdf | Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]]. Please note how this section was divided into different parts. Please follow the same template.
 
  
 
=== Study introduction ===  
 
=== Study introduction ===  
STUDY 1: ETA VS. OTHER MEASURES OF EFFORT
+
Study Introduction:
We begin by comparing ETA and other measures of difficulty
+
 
(including time and subjective difficulty) across a number of
+
[[Study 1]]: Measuring the Variance in Quality of Task Completion Due to Requester Quality
common crowdsourcing tasks. After describing the experimental
+
 
setup, designed to elicit the necessary data to generate
+
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms.  
error-time curves and other measures for each task, we show
+
 
how closely the different measures matched.
+
The goal is to understand how important '''requesters are for enabling workers''' to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.
 +
 
 +
We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.
 +
 
 +
 
 +
[[Study 2]]: Measuring the Potential Benefits of Design Interventions
 +
 
 +
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters.
 +
 
 +
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well.
 +
 
 +
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1
 +
 
 +
 
  
 
=== Study method ===
 
=== Study method ===
Method: Study 1 and all subsequent experiments reported in this paper
 
were conducted using a proprietary microtasking platform
 
that outsources crowd work to workers on the Clickworker
 
microtask market. The platform interface is similar to that
 
of Amazon Mechanical Turk; users upload HTML task files,
 
workers choose from a marketplace listing of tasks, and data
 
is collected in CSV files. We restricted workers to those residing
 
in the United States. Across all studies, 470 unique workers
 
completed over 44 thousand tasks. A followup survey
 
revealed that approximately 66% were female. We replicated
 
Study 1 on Amazon Mechanical Turk and found empirically
 
  
similar results, so we only report results using Clickworker in
+
[[Study 1]]: We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker.  We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended.
this paper.
+
 
 +
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.
 +
 
 +
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded.
 +
 
 +
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters.
 +
 
 +
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).
 +
 
 +
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.  
  
 
=== Method specifics and details ===  
 
=== Method specifics and details ===  
Primitive Crowdsourcing Task Types
+
 
We began by populating our evaluation tasks with common
+
Proposed Tasks and Datasets:
crowdsourcing task types, or primitives, that appear commonly
+
 
as microtasks or parts of microtasks. To do this, we
+
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.
looked at the types of tasks with the most available HITs
+
 
on Amazon Mechanical Turk, at reports on common crowdsourcing
+
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.
task types [15], and at crowdsourcing systems described
+
   
in the literature (e.g., [4]). After several iterations
+
Datasets are readily available from this source:
we identified a list of ten primitives that are present in most
+
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
crowdsourcing workflows (Table 1, Figure 2). For example,
+
 
the Find-Fix-Verify workflow [4] could be expressed using
+
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:
a combination of the FIND (identify sentences which need
+
 
shortening), FIX (shortening these sentences), and BINARY
+
What type of English is being used (American, British, etc)
primitives (verifying the shortening is an improvement). In
+
Look at each word independently, or make corrections based on context.
many cases, the primitives themselves (or repetitions of the
+
 
same primitive) make up the entire task, and map directly to
+
An example of such a task can be seen in this source:
common Mechanical Turk tasks (e.g., finding facts such as
+
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt
phone numbers about individuals (SEARCH)).
+
 
We instantiated these primitives using a dataset of images of
+
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]
people performing different actions (e.g., waving, cooking)
+
 
[34] and a corpus of translated Wikipedia articles selected because
+
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.
they tend to contain errors [1].
+
 
 +
There is a wealth of datasets available for this task. Example:
 +
https://www.speech.cs.cmu.edu/databases/an4/
 +
 
 +
*4.
 +
 
 +
*5.
 +
 
 +
Crowdsourcing Platform:
 +
 
 +
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study.
 +
 
 +
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.
 +
 
 +
 
  
 
=== Experimental Design for the study ===  
 
=== Experimental Design for the study ===  
Line 77: Line 101:
  
 
=== Measures from the study ===  
 
=== Measures from the study ===  
The information we logged allowed us to calculate behavioral
+
 
measures for each primitive:
+
[[Study 1]]:  
– ETA. The ETA is the area under the error-time curve.
+
 
– Time@10. We also calculated the time it takes to achieve
+
Linear Regression 1 -  
an error rate at the 10th percentile.
+
 
– Error. We measured the error rate against ground truth
+
y = b0 + b1(1) + b2(2)+b3(3) + ... + b10(9)
for each primitive. If there were many possible correct
+
y = b1 + b2w2 + b2w2 + ... + + b10w10 
responses, we manually judged responses while blind to
+
 
condition. Automatically computing distance metrics (e.g.,
+
y Quality of Worker Output =  Number of Correct ResponsesTotal Number of Responses Required in task     
edit distance) resulted in empirically similar findings.
+
The data for y is collected for each HIT that the worker has submitted.
– Time. We measured how long workers took to complete the
+
 
primitive without any time limit.
+
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.
After each task block was complete, we additionally asked
+
 
workers to record several subjective reflections:
+
 
– Estimated time. We asked workers to report how long they
+
Linear Regression 2 -
thought they spent on a primitive absent time pressure.
+
 
Time estimation has previously been used as an implicit
+
y = b1 + b2w2  + b3w3  + ... + b10w10  + a2r2  + a3r3  + ... + a10r10
signal of task difficulty [5].
+
 
– Relative subjective duration (RSD). RSD, a measure of
+
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.
how much task time is over- or underestimated [5], is obtained
+
 
by dividing the difference between estimated and
+
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of  worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.
actual time spent by the actual time spent.
+
 
– Task load index (TLX). The NASA TLX [10] is a validated
+
[[Study 2]]:
metric of mental workload commonly used in human factors
+
 
research to assess task performance. It consists of a
+
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.
survey that sums six subjective dimensions (e.g., mental
+
 
demand).
+
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.  
A separate experimental design that contained all ten primitives,
+
 
where each worker completed three untimed practice
+
questions followed by three untimed questions for each primtive
+
(with the primitives presented in random order), was used
+
to obtain the
+
– Subjective rank. Workers considered all of the primitives
+
they completed and ranked them in order of effort required.
+
As rankings produce sharper distinctions than individual ratings
+
[2], we consider subjective rank to represent our ground
+
truth ranking of the primitives. However, rank would not be a
+
deployable solution for requesters. Ranking means that workers
+
would need to test the new task against at least log(n)
+
of the primitives, incurring a large fixed overhead. Further,
+
ranking is ordinal, and cannot quantify small changes in effort.
+
In contrast, ETA is an absolute ranking, can measure
+
small changes in effort, and only needs to be measured for
+
the target task to compare it with other tasks.
+
  
 
=== What do we want to analyze? ===
 
=== What do we want to analyze? ===
Analysis
+
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables.  
60 workers completed Study 1, with 30 performing each
+
 
primitive. We averaged our dependent measures across all
+
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.
30 workers, and compared the ranking of primitives induced
+
by each measure to the average subjective ranking (subjective
+
rank was obtained by having 40 other workers rank all
+
ten primitives). We used the Kendall rank correlation coeffi-
+
cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under
+
the null hypothesis of no association. A rank correlation
+
of 1 indicates perfect correlation; 0 indicates no correlation.
+
Measures that capture the subjective ranking accurately can
+
analyze new tasks types without comparing them against multiple
+
benchmark tasks.
+

Revision as of 03:59, 14 February 2016

We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.


Study introduction

Study Introduction:

Study 1: Measuring the Variance in Quality of Task Completion Due to Requester Quality

In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms.

The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.

We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.


Study 2: Measuring the Potential Benefits of Design Interventions

We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters.

We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that Boomerang itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well.

The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1


Study method

Study 1: We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are complex enough to require the requester to design the task well in terms of a clear explanation, sample questions or general task format. At the same time, the task should not be such that the task answers might be subjective or open ended.

Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.

Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded.

Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters.

We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).

In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.

Method specifics and details

Proposed Tasks and Datasets:

As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.

  • 1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.

Datasets are readily available from this source: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets

  • 2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:

What type of English is being used (American, British, etc) Look at each word independently, or make corrections based on context.

An example of such a task can be seen in this source: https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt

This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]

  • 3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.

There is a wealth of datasets available for this task. Example: https://www.speech.cs.cmu.edu/databases/an4/

  • 4.
  • 5.

Crowdsourcing Platform:

For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study.

For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.


Experimental Design for the study

We presented workers with a mixed series of tasks from the ten primitives and manipulated two factors: the time limit and the primitive. Each primitive had seven different possible time limits, and one untimed condition. The exact time limits were initialized using how long workers took when not under time pressure. The result was a sampled, not fully-crossed, design. For each worker we randomly selected five primitives for them to perform; for each primitive, three questions of that type were shown with each of the specified time limits. The images or text used in these questions were randomly sampled and shuffled for each worker. To minimize practice effects, workers completed three timed practice questions prior to seeing any of these conditions. The tasks were presented in randomized order, and within each primitive the time conditions were presented in randomized order. Workers were compensated $2.00 and repeat participation was disallowed. A single task was presented on each page, allowing us to record how long workers took to submit a response. Under timed conditions, a timer started as soon as the worker advanced to the next page. Input was disabled as soon as the timer expired, regardless of what the worker was doing (e.g., typing, clicking). An example task is shown in Figure 3.

Measures from the study

Study 1:

Linear Regression 1 -

y = b0 + b1(1) + b2(2)+b3(3) + ... + b10(9) y = b1 + b2w2 + b2w2 + ... + + b10w10

y Quality of Worker Output =   Number of Correct ResponsesTotal Number of Responses Required in task       

The data for y is collected for each HIT that the worker has submitted.

b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.


Linear Regression 2 -

y = b1 + b2w2 + b3w3 + ... + b10w10 + a2r2 + a3r3 + ... + a10r10

Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.

If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.

Study 2:

In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.

The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.


What do we want to analyze?

The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables.

Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.