WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship
We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.
Study 1: Measuring the Variance in Quality of Task Completion Due to Requester Quality
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms.
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the quality of task authorship is a major issue among workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.
Study 2: Measuring the Potential Benefits of Design Interventions
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters.
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that Boomerang itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well.
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are complex enough to require the requester to design the task well in terms of a clear explanation, sample questions or general task format. At the same time, the task should not be such that the task answers might be subjective or open ended.
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded.
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters.
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.
Method specifics and details
Proposed Tasks and Datasets:
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so. Datasets are readily available from this source: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like: - What type of English is being used (American, British, etc) - Look at each word independently, or make corrections based on context. An example of such a task can be seen in this source: https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc. There is a wealth of datasets available for this task. Example: https://www.speech.cs.cmu.edu/databases/an4/
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study.
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.
Measures from the study
Linear Regression 1 -
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.
Linear Regression 2 -
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.
What do we want to analyze?
Goodness of Fit (R-Squared Value)
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables.
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.
F - Statistic Test
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression.
Steps in Conducting F - Test:
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge
Concerns about this approach
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the course of the rest of the experiment.
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.
@sreenihit , @aditya_nadimpalli