WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship
We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.
Study 1: Measuring the Variance in Quality of Task Completion Due to Requester Quality
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms.
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.
We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.
Study 2: Measuring the Potential Benefits of Design Interventions
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters.
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that Boomerang itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well.
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1
Study 1: We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are complex enough to require the requester to design the task well in terms of a clear explanation, sample questions or general task format. At the same time, the task should not be such that the task answers might be subjective or open ended.
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded.
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters.
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.
Method specifics and details
Proposed Tasks and Datasets:
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.
- 1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.
Datasets are readily available from this source: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- 2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:
What type of English is being used (American, British, etc) Look at each word independently, or make corrections based on context.
An example of such a task can be seen in this source: https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]
- 3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.
There is a wealth of datasets available for this task. Example: https://www.speech.cs.cmu.edu/databases/an4/
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study.
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.
Experimental Design for the study
We presented workers with a mixed series of tasks from the ten primitives and manipulated two factors: the time limit and the primitive. Each primitive had seven different possible time limits, and one untimed condition. The exact time limits were initialized using how long workers took when not under time pressure. The result was a sampled, not fully-crossed, design. For each worker we randomly selected five primitives for them to perform; for each primitive, three questions of that type were shown with each of the specified time limits. The images or text used in these questions were randomly sampled and shuffled for each worker. To minimize practice effects, workers completed three timed practice questions prior to seeing any of these conditions. The tasks were presented in randomized order, and within each primitive the time conditions were presented in randomized order. Workers were compensated $2.00 and repeat participation was disallowed. A single task was presented on each page, allowing us to record how long workers took to submit a response. Under timed conditions, a timer started as soon as the worker advanced to the next page. Input was disabled as soon as the timer expired, regardless of what the worker was doing (e.g., typing, clicking). An example task is shown in Figure 3.
Measures from the study
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.
Linear Regression 2 -
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.
What do we want to analyze?
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables.
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.