http://crowdresearch.stanford.edu/w/api.php?action=feedcontributions&user=Sreenihitmunakala&feedformat=atomcrowdresearch - User contributions [en]2022-01-21T07:41:27ZUser contributionsMediaWiki 1.24.1http://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17938WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T18:27:02Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, '''there will be a constant set of workers who will have seen the same tasks designed by different requesters'''. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. '''Classification of Product Reviews as Positive or Negative''': This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. '''Spelling Correction''': In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. '''Audio Transcription''': Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4. '''Image Labeling''': Although it can be argued that this requires limited involvement from requesters, it is one of the most popular<br />
tasks on crowdsourcing platforms. It is hard to ignore this task.<br />
The sources for datasets of this kind are plenty.<br />
<br />
5. '''Categorize What You See''': Amazon's tasks are often of this kind, where a product's category is to be determined. <br />
Other variations of this task exist as well, as are large datasets.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
<br />
==== Concerns about this approach ====<br />
<br />
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the course of the rest of the experiment. <br />
<br />
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17937WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T18:26:42Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, '''there will be a constant set of workers who will have seen the same tasks designed by different requesters'''. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. '''Classification of Product Reviews as Positive or Negative''': This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. '''Spelling Correction''': In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. '''Audio Transcription''': Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4. '''Image Labeling''': Although it can be argued that this requires limited involvement from requesters, it is one of the most popular<br />
tasks on crowdsourcing platforms. It is hard to ignore this task.<br />
The sources for datasets of this kind are plenty.<br />
<br />
<br />
5. '''Categorize What You See''': Amazon's tasks are often of this kind, where a product's category is to be determined. <br />
Other variations of this task exist as well, as are large datasets.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
<br />
==== Concerns about this approach ====<br />
<br />
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the course of the rest of the experiment. <br />
<br />
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17911WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:20:17Z<p>Sreenihitmunakala: /* Concerns about this approach */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, '''there will be a constant set of workers who will have seen the same tasks designed by different requesters'''. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. '''Classification of Product Reviews as Positive or Negative''': This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. '''Spelling Correction''': In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. '''Audio Transcription''': Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
<br />
==== Concerns about this approach ====<br />
<br />
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the course of the rest of the experiment. <br />
<br />
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17900WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:08:47Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, '''there will be a constant set of workers who will have seen the same tasks designed by different requesters'''. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. '''Classification of Product Reviews as Positive or Negative''': This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. '''Spelling Correction''': In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. '''Audio Transcription''': Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
<br />
==== Concerns about this approach ====<br />
<br />
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the ccourse of the rest of the experiment. <br />
<br />
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17899WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:08:20Z<p>Sreenihitmunakala: /* Study method */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, '''there will be a constant set of workers who will have seen the same tasks designed by different requesters'''. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
<br />
==== Concerns about this approach ====<br />
<br />
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the ccourse of the rest of the experiment. <br />
<br />
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17898WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:07:10Z<p>Sreenihitmunakala: </p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
<br />
==== Concerns about this approach ====<br />
<br />
We have some concerns about this experiment that we're not sure about. We suspect it is possible that a worker may encounter a well authored/designed task halfway into the experiment, and as a consequence will perform very well whenever he/she encounters a task of the same type again over the ccourse of the rest of the experiment. <br />
<br />
We are not sure how valid our concerns are, if it is a real possibility it may skew the results of the experiment. We welcome any suggestions or counter-points regarding this.<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17895WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:02:36Z<p>Sreenihitmunakala: /* F - Statistic Test */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model, i.e, we can imagine that r2 ... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17893WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:02:07Z<p>Sreenihitmunakala: /* Measures from the study */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq3.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=File:Eq3.jpg&diff=17892File:Eq3.jpg2016-02-14T17:01:58Z<p>Sreenihitmunakala: </p>
<hr />
<div></div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17888WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T17:00:38Z<p>Sreenihitmunakala: /* Measures from the study */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Oldeq.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=File:Oldeq.jpg&diff=17887File:Oldeq.jpg2016-02-14T17:00:20Z<p>Sreenihitmunakala: </p>
<hr />
<div></div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17885WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:58:32Z<p>Sreenihitmunakala: /* Measures from the study */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Eq1.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17884WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:57:29Z<p>Sreenihitmunakala: /* F - Statistic Test */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq1.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
''For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge''<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=File:Eq1.jpg&diff=17883File:Eq1.jpg2016-02-14T16:54:48Z<p>Sreenihitmunakala: Sreenihitmunakala uploaded a new version of File:Eq1.jpg</p>
<hr />
<div></div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=File:Eq1.jpg&diff=17881File:Eq1.jpg2016-02-14T16:52:53Z<p>Sreenihitmunakala: Sreenihitmunakala uploaded a new version of File:Eq1.jpg</p>
<hr />
<div></div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17880WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:52:34Z<p>Sreenihitmunakala: /* Measures from the study */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq1.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17875WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:49:31Z<p>Sreenihitmunakala: /* Experiment Design */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker IDs and Requester IDs range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Prototype Tasks in the Daemo Platform. An additional column will be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17870WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:47:48Z<p>Sreenihitmunakala: /* Experiment Design */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets randomly and distribute these to the requesters. Requesters post these tasks on the selected crowd sourcing platform (eg. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five tasks (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual tasks. We propose to give the workers five days to submit all their tasks.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17866WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:45:49Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing <br />
Effort with Error-Time Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17864WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:45:25Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time <br />
Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17862WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:44:36Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what <br />
is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly <br />
specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time <br />
Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17858WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:43:56Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why <br />
something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
- An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time <br />
Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17854WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:43:13Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why <br />
something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
- What type of English is being used (American, British, etc)<br />
- Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time <br />
Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17851WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:42:28Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why <br />
something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time <br />
Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17848WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:41:52Z<p>Sreenihitmunakala: /* Method specifics and details */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why <br />
something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time <br />
Curves. CHI 2015.]<br />
<br />
3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed <br />
task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
4.<br />
<br />
5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17836WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:33:22Z<p>Sreenihitmunakala: /* Study method */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is '''as close as possible to the ground truth'''. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]<br />
<br />
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
*4.<br />
<br />
*5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17833WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T16:31:00Z<p>Sreenihitmunakala: /* Study introduction */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important requesters are for enabling workers to produce high quality work. While requesters often claim they face issues with quality of workers, the '''quality of task authorship is a major issue among workers''' as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim to perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[File:Experiment_design.png]]<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1.<br />
<br />
=== Study method ===<br />
<br />
We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each worker will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker may encounter a total of '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model.<br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]<br />
<br />
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
*4.<br />
<br />
*5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
<br />
<br />
=== Experiment Design === <br />
<br />
'''Study 1'''<br />
<br />
Each of the 10 requesters receives 5 data sets corresponding to the 5 types of tasks discussed above. The task of each type is same for each requester, and we simply partition the overall data set for each task into ten subsets and distribute these to the requesters.Requesters post these HITs on the selected crowd sourcing platform (ex. Amazon Mechanical Turk).<br />
<br />
Each worker takes up five HITs (one of each type of task) from each requester. The total workload on each worker is therefore 50 individual HITs. We propose to give the workers five days to submit all their HITs.<br />
<br />
We collect the submissions from each requester and compare them with the ground truth to count the number of correct responses for each worker. The results of the experiment may be tabulated as follows:<br />
<br />
[[File:Sample_data.PNG]]<br />
<br />
Here, Worker Ids and Requester Ids range from 1 - 10, and Job Types range from 1 - 5.<br />
<br />
'''Study 2'''<br />
<br />
Similar to Study 1, except that workers have the option to now give feedback (before submitting) on poorly designed tasks. This feedback can be given through emails/messages between the workers and requesters or through Protosype Tasks in the Daemo Platform. An additional column would be present in the data table representing the intervention used by the worker or requester.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:Equations.PNG]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
<br />
==== Goodness of Fit (R-Squared Value) ====<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.<br />
<br />
==== F - Statistic Test ====<br />
<br />
We can compare the two regression models using an F - Test, and statistically justify the inclusion of requesters as explanatory variables for worker output quality. To apply the F - Test, we consider the first regression model to be a restricted form of the second regression model,i.e, we can imagine that r2 .......... r10 are all zero in the first regression. <br />
<br />
Steps in Conducting F - Test:<br />
<br />
1. We state are null (default) hypothesis as r2 ...... r10 are all equal to zero. Now we would like to use the data generated in the experiment to prove this null hypothesis wrong.<br />
<br />
2. A statistics package like Stata, can take both regression models as inputs and generate a value called an F - Statistic. This value is internally computed from the R-Squared values of both regressions.<br />
<br />
3. If the F - Statistic value is above a certain threshold value, we can confidently reject the null hypotheis and show that is beneficial to include the extra requester variables in explaining worker output quality.<br />
<br />
For a detailed background on R-squared and F - Statistic values, please refer Chapters 3 -4, "Introductory Econometrics : A Modern Approach (2012)" by Jeffrey M. Wooldridge<br />
<br />
<br />
== Milestone Contributors ==<br />
@sreenihit , @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17741WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T12:20:07Z<p>Sreenihitmunakala: /* Measures from the study */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
Study Introduction:<br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important '''requesters are for enabling workers''' to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1<br />
<br />
<br />
<br />
=== Study method ===<br />
<br />
'''Study 1''': We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model. <br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]<br />
<br />
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
*4.<br />
<br />
*5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
<br />
<br />
=== Experimental Design for the study === <br />
We presented workers with a mixed series of tasks from the<br />
ten primitives and manipulated two factors: the time limit<br />
and the primitive. Each primitive had seven different possible<br />
time limits, and one untimed condition. The exact time limits<br />
were initialized using how long workers took when not under<br />
time pressure. The result was a sampled, not fully-crossed,<br />
design. For each worker we randomly selected five primitives<br />
for them to perform; for each primitive, three questions of that<br />
type were shown with each of the specified time limits. The<br />
images or text used in these questions were randomly sampled<br />
and shuffled for each worker. To minimize practice effects,<br />
workers completed three timed practice questions prior<br />
to seeing any of these conditions. The tasks were presented<br />
in randomized order, and within each primitive the time conditions<br />
were presented in randomized order. Workers were<br />
compensated $2.00 and repeat participation was disallowed.<br />
A single task was presented on each page, allowing us to<br />
record how long workers took to submit a response. Under<br />
timed conditions, a timer started as soon as the worker advanced<br />
to the next page. Input was disabled as soon as the<br />
timer expired, regardless of what the worker was doing (e.g.,<br />
typing, clicking). An example task is shown in Figure 3.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
<br />
[[File:eq1.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17740WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T12:19:45Z<p>Sreenihitmunakala: /* Measures from the study */</p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
Study Introduction:<br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important '''requesters are for enabling workers''' to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1<br />
<br />
<br />
<br />
=== Study method ===<br />
<br />
'''Study 1''': We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model. <br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]<br />
<br />
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
*4.<br />
<br />
*5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
<br />
<br />
=== Experimental Design for the study === <br />
We presented workers with a mixed series of tasks from the<br />
ten primitives and manipulated two factors: the time limit<br />
and the primitive. Each primitive had seven different possible<br />
time limits, and one untimed condition. The exact time limits<br />
were initialized using how long workers took when not under<br />
time pressure. The result was a sampled, not fully-crossed,<br />
design. For each worker we randomly selected five primitives<br />
for them to perform; for each primitive, three questions of that<br />
type were shown with each of the specified time limits. The<br />
images or text used in these questions were randomly sampled<br />
and shuffled for each worker. To minimize practice effects,<br />
workers completed three timed practice questions prior<br />
to seeing any of these conditions. The tasks were presented<br />
in randomized order, and within each primitive the time conditions<br />
were presented in randomized order. Workers were<br />
compensated $2.00 and repeat participation was disallowed.<br />
A single task was presented on each page, allowing us to<br />
record how long workers took to submit a response. Under<br />
timed conditions, a timer started as soon as the worker advanced<br />
to the next page. Input was disabled as soon as the<br />
timer expired, regardless of what the worker was doing (e.g.,<br />
typing, clicking). An example task is shown in Figure 3.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
[[File:eq1.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself.<br />
<br />
=== What do we want to analyze? ===<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=File:Eq2.jpg&diff=17739File:Eq2.jpg2016-02-14T12:19:21Z<p>Sreenihitmunakala: </p>
<hr />
<div></div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=File:Eq1.jpg&diff=17738File:Eq1.jpg2016-02-14T12:19:06Z<p>Sreenihitmunakala: </p>
<hr />
<div></div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17737WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T12:18:41Z<p>Sreenihitmunakala: </p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
Study Introduction:<br />
<br />
'''Study 1''': Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important '''requesters are for enabling workers''' to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
'''Study 2''': Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1<br />
<br />
<br />
<br />
=== Study method ===<br />
<br />
'''Study 1''': We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model. <br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]<br />
<br />
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
*4.<br />
<br />
*5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
<br />
<br />
=== Experimental Design for the study === <br />
We presented workers with a mixed series of tasks from the<br />
ten primitives and manipulated two factors: the time limit<br />
and the primitive. Each primitive had seven different possible<br />
time limits, and one untimed condition. The exact time limits<br />
were initialized using how long workers took when not under<br />
time pressure. The result was a sampled, not fully-crossed,<br />
design. For each worker we randomly selected five primitives<br />
for them to perform; for each primitive, three questions of that<br />
type were shown with each of the specified time limits. The<br />
images or text used in these questions were randomly sampled<br />
and shuffled for each worker. To minimize practice effects,<br />
workers completed three timed practice questions prior<br />
to seeing any of these conditions. The tasks were presented<br />
in randomized order, and within each primitive the time conditions<br />
were presented in randomized order. Workers were<br />
compensated $2.00 and repeat participation was disallowed.<br />
A single task was presented on each page, allowing us to<br />
record how long workers took to submit a response. Under<br />
timed conditions, a timer started as soon as the worker advanced<br />
to the next page. Input was disabled as soon as the<br />
timer expired, regardless of what the worker was doing (e.g.,<br />
typing, clicking). An example task is shown in Figure 3.<br />
<br />
=== Measures from the study === <br />
<br />
'''Study 1''': <br />
<br />
Linear Regression 1 - <br />
[[File:eq1.jpg]]<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
[[File:eq2.jpg]]<br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
'''Study 2''':<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself. <br />
<br />
<br />
=== What do we want to analyze? ===<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17736WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T11:59:35Z<p>Sreenihitmunakala: </p>
<hr />
<div>We have attempted to describe the complete experiment for task authorship, to determine the role of requester's input for worker output. We have described the structure of the experiment we wish to conduct and how we intend to analyze the data we obtain. We've taken the main direction of this experiment from the brainstorming sessions.<br />
<br />
<br />
=== Study introduction === <br />
Study Introduction:<br />
<br />
[[Study 1]]: Measuring the Variance in Quality of Task Completion Due to Requester Quality<br />
<br />
In our first study, we wish to quantify the effects of requesters on the quality of workers’ task submissions, over a spread of tasks popular in crowdsourcing platforms. <br />
<br />
The goal is to understand how important '''requesters are for enabling workers''' to produce high quality work. While requesters often claim they face issues with quality of requesters, the quality of task authorship is a major issue amongst workers as well. Workers on crowdsourcing platforms like Amazon Mechanical Turk often feel that requesters do not clearly design/outline the jobs they post and that this poor design quality subsequently leads to more rejections and/or necessitates a lot of feedback from the worker.<br />
<br />
We aim perform this study by asking a fixed group of workers to take up similar tasks posted by different requesters and determine how much of the variance in submission quality can be attributed to the requesters.<br />
<br />
<br />
[[Study 2]]: Measuring the Potential Benefits of Design Interventions<br />
<br />
We wish to determine if the task design interventions (implemented in Daemo) for crowdsourcing tasks will help reduce the variance in worker task quality due caused by the requesters. <br />
<br />
We predict that introducing design features like Prototype Tasks in Daemo and feedback channels between workers and requesters will improve task authorship and workers’ output quality. We also predict that '''Boomerang''' itself will help reduce variance in quality, because requesters pick workers they've rated higher, this implies that the workers responded well to the requester's task authorship/design. And vice-versa, with workers rating requesters higher if they feel a requester authored a task well. <br />
<br />
The effect of these features will be quantified by measuring the variance in worker output caused by requesters after adding the interventions and comparing this quantity with the variance obtained in Study 1<br />
<br />
<br />
<br />
=== Study method ===<br />
<br />
[[Study 1]]: We recruit a group of 10 requesters and 10 workers for the experiment. Each requester will be instructed to post 5 different types of tasks which will be undertaken by each worker. We aim to pick tasks that are '''complex enough to require the requester to design the task well''' in terms of a clear explanation, sample questions or general task format. At the same time, the task should ''not'' be such that the task answers might be subjective or open ended. <br />
<br />
Each requester is provided with the dataset to be used for the task (eg. set of images for image annotation or collection of audio clips for audio transcription) and the ground truth for these datasets. From the requester's perspective, the goal will be to obtain output from the workers that is as close as possible to the ground truth. We will provide minimal details apart from that, to avoid giving the requesters an already designed task.<br />
<br />
Each worker then proceeds to tackle each type of task from each requester. Thus over the course of the experiment, each requester will perform 50 tasks. Each task may consist of 10 individual 'questions', so a worker will encounter '500' questions. The quality of the output of each worker (number of correct answers measured against the ground truth) is recorded. <br />
<br />
Thus, there will be a constant set of workers who will have seen the same tasks designed by different requesters. <br />
<br />
We then use this data to run a linear regression of task output on categorical variables representing the requesters and workers. The R-squared value of the regression (also referred to as the coefficient of determination) shows how much of the variance in the dependent variable (task output) is explained by the independent variables (workers and requesters).<br />
<br />
In addition, the T-test can used to study the statistical significance of each independent variable used in the model and an F-test can be used to study the collective significance of all the variables included in the model. <br />
<br />
=== Method specifics and details === <br />
<br />
Proposed Tasks and Datasets:<br />
<br />
As mentioned earlier, we have chosen tasks that can are not too subjective, require the requester to lucidly define the goal and have desired outputs that can be easily verified as valid or invalid. We have also tried to pick tasks that are fairly representative of common tasks on crowdsourcing platforms like Amazon Mechanical Turk. We have sought out tasks with existing datasets with ground truths, to make the experiment easier to perform.<br />
<br />
*1. Classification of Product Reviews as Positive or Negative: This would require a requester to clearly explain what is expected and why something should be classified as so.<br />
<br />
Datasets are readily available from this source:<br />
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets<br />
<br />
*2. Spelling Correction: In spelling correction,a good requester might improve the quality of the task by clearly specifying factors like:<br />
<br />
What type of English is being used (American, British, etc)<br />
Look at each word independently, or make corrections based on context. <br />
<br />
An example of such a task can be seen in this source:<br />
https://www.ukp.tu-darmstadt.de/fileadmin/user_upload/Group_UKP/data/spelling/en_natural_train.txt<br />
<br />
This is also one of the tasks in "ten primitives common to most crowdsourcing workflow" - [Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]<br />
<br />
*3. Audio Transcription: Audio transcription is a common task that is frequently seen on crowdsourcing platforms. A well designed task requires instructions on how to handle garbled audio, whether to choose American or British spellings etc.<br />
<br />
There is a wealth of datasets available for this task. Example: <br />
https://www.speech.cs.cmu.edu/databases/an4/<br />
<br />
*4.<br />
<br />
*5.<br />
<br />
Crowdsourcing Platform:<br />
<br />
For the first study, tasks will be posted on Amazon Mechanical Turk, as this is currently one of the most widely used platforms. We ask the workers to not email or contact the requester in any way, as we study the impact of such interventions in the next study. <br />
<br />
For the second study, we switch to the Daemo system. Daemo has already implemented design interventions such as Prototype Tasks. Alternatively, we could continue to use Mechanical Turk, but allow workers to use the mailing option to communicate with requesters and give feedback.<br />
<br />
<br />
<br />
=== Experimental Design for the study === <br />
We presented workers with a mixed series of tasks from the<br />
ten primitives and manipulated two factors: the time limit<br />
and the primitive. Each primitive had seven different possible<br />
time limits, and one untimed condition. The exact time limits<br />
were initialized using how long workers took when not under<br />
time pressure. The result was a sampled, not fully-crossed,<br />
design. For each worker we randomly selected five primitives<br />
for them to perform; for each primitive, three questions of that<br />
type were shown with each of the specified time limits. The<br />
images or text used in these questions were randomly sampled<br />
and shuffled for each worker. To minimize practice effects,<br />
workers completed three timed practice questions prior<br />
to seeing any of these conditions. The tasks were presented<br />
in randomized order, and within each primitive the time conditions<br />
were presented in randomized order. Workers were<br />
compensated $2.00 and repeat participation was disallowed.<br />
A single task was presented on each page, allowing us to<br />
record how long workers took to submit a response. Under<br />
timed conditions, a timer started as soon as the worker advanced<br />
to the next page. Input was disabled as soon as the<br />
timer expired, regardless of what the worker was doing (e.g.,<br />
typing, clicking). An example task is shown in Figure 3.<br />
<br />
=== Measures from the study === <br />
<br />
[[Study 1]]: <br />
<br />
Linear Regression 1 - <br />
<br />
y = b0 + b1(1) + b2(2)+b3(3) + ... + b10(9)<br />
y = b1 + b2w2 + b2w2 + ... + + b10w10 <br />
<br />
y Quality of Worker Output = Number of Correct ResponsesTotal Number of Responses Required in task <br />
The data for y is collected for each HIT that the worker has submitted.<br />
<br />
b0 represents the base case when x1 …… x9 are all zero. This refers to the first worker. When x1 = 1 and x2 ... x9 are zero, we are isolating the effect of the second worker. This pattern continues. The coefficients of x1 ... x9 represent the increase or decrease in performance of workers 2 to 10 relative to the performance of worker 1, who is represented by the constant b0.<br />
<br />
<br />
Linear Regression 2 - <br />
<br />
y = b1 + b2w2 + b3w3 + ... + b10w10 + a2r2 + a3r3 + ... + a10r10 <br />
<br />
Here, variables representing the requesters are added. The constant b1 represents the worker performance due to the base requester (requester 1) and the base worker (worker 1) when all other variables are zero.<br />
<br />
If w2 = 1 and r8 = 1 and other variables were zero, this would represent the effect of worker 2 and requester 8 on worker output. The coefficients b2 and a8 in this case specifically measure the additional contribution of worker 2 and requester 8 to task output relative to the base pair of worker 1 and requester 1.<br />
<br />
[[Study 2]]:<br />
<br />
In study 2, we generate the regressions used in Study 1 again, but add an additional binary variable representing the presence or absence of design interventions. Separate binary variables can be added for each intervention included/used to compare the benefits of different interventions.<br />
<br />
The interventions include prototyping tasks, features to email or message requesters for clarification and the Boomerang feature itself. <br />
<br />
<br />
=== What do we want to analyze? ===<br />
The R-squared value of a regression ranges between 0 and 1 and measures what proportion of the variance in the y variable is due to the variables in the right hand side. Mathematically it has been shown that R-squared values increase as the number of variables in the model increase. Hence the second regression in our study will definitely have a higher R-squared value simply because it has more variables. <br />
<br />
Keeping this problem in mind, statistical packages generally calculate a adjusted R-squared which takes into consideration the number of variables used in the model. Adjusted R-squared values of the two regressions can be compared to see if requesters can explain a considerable amount of the variance in the worker output quality.</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17735WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T11:42:15Z<p>Sreenihitmunakala: </p>
<hr />
<div>== Methods (for task authorship write up) ==<br />
<br />
We're going to borrow methods section from this paper as an example: [[:Media:2015 eta (private).pdf | Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]]. Please note how this section was divided into different parts. Please follow the same template.<br />
<br />
=== Study introduction === <br />
STUDY 1: ETA VS. OTHER MEASURES OF EFFORT<br />
We begin by comparing ETA and other measures of difficulty<br />
(including time and subjective difficulty) across a number of<br />
common crowdsourcing tasks. After describing the experimental<br />
setup, designed to elicit the necessary data to generate<br />
error-time curves and other measures for each task, we show<br />
how closely the different measures matched.<br />
<br />
=== Study method ===<br />
Method: Study 1 and all subsequent experiments reported in this paper<br />
were conducted using a proprietary microtasking platform<br />
that outsources crowd work to workers on the Clickworker<br />
microtask market. The platform interface is similar to that<br />
of Amazon Mechanical Turk; users upload HTML task files,<br />
workers choose from a marketplace listing of tasks, and data<br />
is collected in CSV files. We restricted workers to those residing<br />
in the United States. Across all studies, 470 unique workers<br />
completed over 44 thousand tasks. A followup survey<br />
revealed that approximately 66% were female. We replicated<br />
Study 1 on Amazon Mechanical Turk and found empirically<br />
<br />
similar results, so we only report results using Clickworker in<br />
this paper.<br />
<br />
=== Method specifics and details === <br />
Primitive Crowdsourcing Task Types<br />
We began by populating our evaluation tasks with common<br />
crowdsourcing task types, or primitives, that appear commonly<br />
as microtasks or parts of microtasks. To do this, we<br />
looked at the types of tasks with the most available HITs<br />
on Amazon Mechanical Turk, at reports on common crowdsourcing<br />
task types [15], and at crowdsourcing systems described<br />
in the literature (e.g., [4]). After several iterations<br />
we identified a list of ten primitives that are present in most<br />
crowdsourcing workflows (Table 1, Figure 2). For example,<br />
the Find-Fix-Verify workflow [4] could be expressed using<br />
a combination of the FIND (identify sentences which need<br />
shortening), FIX (shortening these sentences), and BINARY<br />
primitives (verifying the shortening is an improvement). In<br />
many cases, the primitives themselves (or repetitions of the<br />
same primitive) make up the entire task, and map directly to<br />
common Mechanical Turk tasks (e.g., finding facts such as<br />
phone numbers about individuals (SEARCH)).<br />
We instantiated these primitives using a dataset of images of<br />
people performing different actions (e.g., waving, cooking)<br />
[34] and a corpus of translated Wikipedia articles selected because<br />
they tend to contain errors [1].<br />
<br />
=== Experimental Design for the study === <br />
We presented workers with a mixed series of tasks from the<br />
ten primitives and manipulated two factors: the time limit<br />
and the primitive. Each primitive had seven different possible<br />
time limits, and one untimed condition. The exact time limits<br />
were initialized using how long workers took when not under<br />
time pressure. The result was a sampled, not fully-crossed,<br />
design. For each worker we randomly selected five primitives<br />
for them to perform; for each primitive, three questions of that<br />
type were shown with each of the specified time limits. The<br />
images or text used in these questions were randomly sampled<br />
and shuffled for each worker. To minimize practice effects,<br />
workers completed three timed practice questions prior<br />
to seeing any of these conditions. The tasks were presented<br />
in randomized order, and within each primitive the time conditions<br />
were presented in randomized order. Workers were<br />
compensated $2.00 and repeat participation was disallowed.<br />
A single task was presented on each page, allowing us to<br />
record how long workers took to submit a response. Under<br />
timed conditions, a timer started as soon as the worker advanced<br />
to the next page. Input was disabled as soon as the<br />
timer expired, regardless of what the worker was doing (e.g.,<br />
typing, clicking). An example task is shown in Figure 3.<br />
<br />
=== Measures from the study === <br />
The information we logged allowed us to calculate behavioral<br />
measures for each primitive:<br />
– ETA. The ETA is the area under the error-time curve.<br />
– Time@10. We also calculated the time it takes to achieve<br />
an error rate at the 10th percentile.<br />
– Error. We measured the error rate against ground truth<br />
for each primitive. If there were many possible correct<br />
responses, we manually judged responses while blind to<br />
condition. Automatically computing distance metrics (e.g.,<br />
edit distance) resulted in empirically similar findings.<br />
– Time. We measured how long workers took to complete the<br />
primitive without any time limit.<br />
After each task block was complete, we additionally asked<br />
workers to record several subjective reflections:<br />
– Estimated time. We asked workers to report how long they<br />
thought they spent on a primitive absent time pressure.<br />
Time estimation has previously been used as an implicit<br />
signal of task difficulty [5].<br />
– Relative subjective duration (RSD). RSD, a measure of<br />
how much task time is over- or underestimated [5], is obtained<br />
by dividing the difference between estimated and<br />
actual time spent by the actual time spent.<br />
– Task load index (TLX). The NASA TLX [10] is a validated<br />
metric of mental workload commonly used in human factors<br />
research to assess task performance. It consists of a<br />
survey that sums six subjective dimensions (e.g., mental<br />
demand).<br />
A separate experimental design that contained all ten primitives,<br />
where each worker completed three untimed practice<br />
questions followed by three untimed questions for each primtive<br />
(with the primitives presented in random order), was used<br />
to obtain the<br />
– Subjective rank. Workers considered all of the primitives<br />
they completed and ranked them in order of effort required.<br />
As rankings produce sharper distinctions than individual ratings<br />
[2], we consider subjective rank to represent our ground<br />
truth ranking of the primitives. However, rank would not be a<br />
deployable solution for requesters. Ranking means that workers<br />
would need to test the new task against at least log(n)<br />
of the primitives, incurring a large fixed overhead. Further,<br />
ranking is ordinal, and cannot quantify small changes in effort.<br />
In contrast, ETA is an absolute ranking, can measure<br />
small changes in effort, and only needs to be measured for<br />
the target task to compare it with other tasks.<br />
<br />
=== What do we want to analyze? ===<br />
Analysis<br />
60 workers completed Study 1, with 30 performing each<br />
primitive. We averaged our dependent measures across all<br />
30 workers, and compared the ranking of primitives induced<br />
by each measure to the average subjective ranking (subjective<br />
rank was obtained by having 40 other workers rank all<br />
ten primitives). We used the Kendall rank correlation coeffi-<br />
cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under<br />
the null hypothesis of no association. A rank correlation<br />
of 1 indicates perfect correlation; 0 indicates no correlation.<br />
Measures that capture the subjective ranking accurately can<br />
analyze new tasks types without comparing them against multiple<br />
benchmark tasks.</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_5_BPHC_Experimental_Design_and_Statistical_Analysis_of_Task_Authorship&diff=17734WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship2016-02-14T11:41:09Z<p>Sreenihitmunakala: Created page with "Please use the following template to write up your introduction section this week. == System (for task feed and open gov write up) == We're going to borrow systems section..."</p>
<hr />
<div>Please use the following template to write up your introduction section this week. <br />
<br />
== System (for task feed and open gov write up) ==<br />
<br />
We're going to borrow systems section from this paper as an example: [[:Media:Twitch Crowdsourcing (private).pdf | Vaish R, Wyngarden K, Chen J, et al. Twitch crowdsourcing: crowd contributions in short bursts of time. Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014: 3645-3654.]] Please note how this section was divided into different parts. Please follow the same template. <br />
<br />
=== Brief introduction of the system === <br />
Twitch is an Android application that appears when the user<br />
presses the phone’s power/lock button (Figures 1 and 3).<br />
When the user completes the twitch crowdsourcing task, the<br />
phone unlocks normally. Each task involves a choice<br />
between two to six options through a single motion such as<br />
a tap or swipe.<br />
<br />
=== How is the system solving critical problems === <br />
To motivate continued participation, Twitch provides both<br />
instant and aggregated feedback to the user. An instant feedback display shows how many other users agreed via a<br />
fadeout as the lock screen disappears (Figure 4) or how the<br />
user’s contributions apply to the whole (Figure 5).<br />
Aggregated data is also available via a web application,<br />
allowing the user to explore all data that the system has<br />
collected. For example, Figure 2 shows a human generated<br />
map from the Census application.<br />
To address security concerns, users are allowed to either<br />
disable or keep their existing Android passcode while using<br />
Twitch. If users do not wish to answer a question, they may<br />
skip Twitch by selecting ‘Exit’ via the options menu. This<br />
design decision has been made to encourage the user to give<br />
Twitch an answer, which is usually faster than exiting.<br />
Future designs could make it easier to skip a task, for<br />
example through a swipe-up.<br />
<br />
=== Introducing modules of the system === <br />
Below, we introduce the three main crowdsourcing<br />
applications that Twitch supports. The first, Census,<br />
attempts to capture local knowledge. The following two,<br />
Image Voting and Structuring the Web, draw on creative<br />
and topical expertise. These three applications are bundled<br />
into one Android package, and each can be accessed<br />
interchangeably through Twitch's settings menu.<br />
<br />
=== Module 1: Census === <br />
<br />
==== Problem/Limitations ====<br />
Despite progress in producing effective understanding of<br />
static elements of our physical world — routes, businesses<br />
and points of interest — we lack an understanding of<br />
human activity. How busy is the corner cafe at 2pm on<br />
Fridays? What time of day do businesspeople clear out of<br />
the downtown district and get replaced by socializers?<br />
Which neighborhoods keep high-energy activities going<br />
until 11pm, and which ones become sleepy by 6pm? Users<br />
could take advantage of this information to plan their<br />
commutes, their social lives and their work.<br />
<br />
==== Module preview ====<br />
Existing crowdsourced techniques such as Foursquare are<br />
too sparse to answer these kinds of questions: the answers<br />
require at-the-moment, distributed human knowledge. We<br />
envision that twitch crowdsourcing can help create a<br />
human-centered equivalent of Google Street View, where a<br />
user could browse typical crowd activity in an area. To do<br />
so, we ask users to answer one of several questions about the world around them each time they unlock their phone.<br />
Users can then browse the map they are helping create.<br />
<br />
==== System details ==== <br />
Census is the default crowdsourcing task in Twitch. It<br />
collects structured information about what people<br />
experience around them. Each Census unlock screen<br />
consists of four to six tiles (Figures 1 and 3), each task<br />
centered around questions such as:<br />
• How many people are around you?<br />
• What kinds of attire are nearby people wearing?<br />
• What are you currently doing?<br />
• How much energy do you have right now?<br />
While not exhaustive, these questions cover several types of<br />
information that a local census might seek to provide. Two<br />
of the four questions ask users about the people around<br />
them, while the other two ask about users themselves; both<br />
of which they are uniquely equipped to answer. Each<br />
answer is represented graphically; for example, in case of<br />
activities, users have icons for working, at home, eating,<br />
travelling, socializing, or exercising.<br />
To motivate continued engagement, Census provides two<br />
modes of feedback. Instant feedback (Figure 4) is a brief<br />
Android popup message that appears immediately after the<br />
user makes a selection. It reports the percentage of<br />
responses in the current time bin and location that agreed<br />
with the user, then fades out within two seconds. It is<br />
transparent to user input, so the user can begin interacting<br />
with the phone even while it is visible. Aggregated report<br />
allows Twitch users to see the cumulative effect of all<br />
users’ behavior. The data is bucketed and visualized on a<br />
map (Figure 2) on the Twitch homepage. Users can filter<br />
the data based on activity type or time of day.<br />
<br />
<br />
=== Module 2: Photo Ranking ===<br />
<br />
==== Problem/Limitations ====<br />
Beyond harnessing local observations via Census, we<br />
wanted to demonstrate that twitch crowdsourcing could<br />
support traditional crowdsourcing tasks such as image ranking (e.g., Matchin [17]). Needfinding interviews and<br />
prototyping sessions with ten product design students at<br />
Stanford University indicated that product designers not<br />
only need photographs for their design mockups, but they<br />
also enjoy looking at the photographs. Twitch harnesses<br />
this interest to help rank photos and encourage contribution<br />
of new photos.<br />
<br />
==== Module details ====<br />
Photo Ranking crowdsources a ranking of stock photos for<br />
themes from a Creative Commons-licensed image library.<br />
The Twitch task displays two images related to a theme<br />
(e.g., Nature Panorama) per unlock and asks the user to<br />
slide to select the one they prefer (Figure 1). Pairwise<br />
ranking is considered faster and more accurate than rating<br />
[17]. The application regularly updates with new photos.<br />
Users can optionally contribute new photos to the database<br />
by taking a photo instead of rating one. Contributed photos<br />
must be relevant to the day’s photo theme, such as Nature<br />
Panorama, Soccer, or Beautiful Trash. Contributing a photo<br />
takes longer than the average Twitch task, but provides an<br />
opportunity for motivated individuals to enter the<br />
competition and get their photos rated.<br />
Like with Census, users receive instant feedback through a<br />
popup message to display how many other users agreed<br />
with their selection. We envision a web interface where all<br />
uploaded images can be browsed, downloaded and ranked.<br />
This data can also connect to computer vision research by<br />
providing high-quality images of object categories and<br />
scenes to create better classifiers.<br />
<br />
=== Module 3: Structuring the Web === <br />
<br />
==== Problem/Limitations ==== <br />
Search engines no longer only return documents — they<br />
now aim to return direct answers [6,9]. However, despite<br />
massive undertakings such as the Google Knowledge Graph<br />
[36], Bing Satori [37] and Freebase [7], much of the<br />
knowledge on the web remains unstructured and unavailable for interactive applications. For example,<br />
searching for ‘Weird Al Yankovic born’ in a search engine<br />
such as Google returns a direct result ‘1959’ drawn from<br />
the knowledge base; however, searching for the equally<br />
relevant ‘Weird Al Yankovic first song’, ‘Weird Al<br />
Yankovic band members’, or ‘Weird Al Yankovic<br />
bestselling album’ returns a long string of documents but no<br />
direct answer, even though the answers are readily available<br />
on the performer’s Wikipedia page.<br />
<br />
==== Module preview ==== <br />
To enable direct answers, we need structured data that is<br />
computer-readable. While crowdsourced undertakings such<br />
as Freebase and dbPedia have captured much structured<br />
data, they tend to only acquire high-level information and<br />
do not have enough contributors to achieve significant<br />
depth on any single entity. Likewise, while information<br />
extraction systems such as ReVerb [14] automatically draw<br />
such information from the text of the Wikipedia page, their<br />
error rates are currently too high to trust. Crowdsourcing<br />
can help such systems identify errors to improve future<br />
accuracy [18]. Therefore, we apply twitch crowdsourcing to<br />
produce both structured data for interactive applications and<br />
training data for information extraction systems.<br />
<br />
==== Module details ====<br />
Contributors to online efforts are drawn to goals that allow<br />
them to exhibit their unique expertise [2]. Thus, we allow<br />
users to help create structured data for topics of interest.<br />
The user can specify any topic on Wikipedia that they are<br />
interested in or want to learn about, for example HCI, the<br />
Godfather films, or their local city. To do so within a oneto-two<br />
second time limit, we draw on mixed-initiative<br />
information extraction systems (e.g., [18]) and ask users to<br />
help vet automatic extractions.<br />
When a user unlocks his or her phone, Structuring the Web<br />
displays a high-confidence extraction generated using<br />
ReVerb, and its source statement from the selected<br />
Wikipedia page (Figure 1). The user indicates with one<br />
swipe whether the extraction is correct with respect to the<br />
statement. ReVerb produces an extraction in SubjectRelationship-Object<br />
format: for example, if the source<br />
statement is “Stanford University was founded in 1885 by<br />
Leland Stanford as a memorial to their son”, ReVerb<br />
returns {Stanford University}, {was founded in}, {1885}<br />
and Twitch displays this structure. To minimize cognitive<br />
load and time requirements, the application filters only<br />
include short source sentences and uses color coding to<br />
match extractions with the source text.<br />
In Structuring the Web, the instant feedback upon accepting<br />
an extraction shows the user their progress growing a<br />
knowledge tree of verified facts (Figure 5). Rejecting an<br />
extraction instead scrolls the user down the article as far as<br />
their most recent extraction source, demonstrating the<br />
user’s progress in processing the article. In the future, we<br />
envision that search engines can utilize this data to answer a<br />
wider range of factual queries.<br />
<br />
== Methods (for task authorship write up) ==<br />
<br />
We're going to borrow methods section from this paper as an example: [[:Media:2015 eta (private).pdf | Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]]. Please note how this section was divided into different parts. Please follow the same template.<br />
<br />
=== Study introduction === <br />
STUDY 1: ETA VS. OTHER MEASURES OF EFFORT<br />
We begin by comparing ETA and other measures of difficulty<br />
(including time and subjective difficulty) across a number of<br />
common crowdsourcing tasks. After describing the experimental<br />
setup, designed to elicit the necessary data to generate<br />
error-time curves and other measures for each task, we show<br />
how closely the different measures matched.<br />
<br />
=== Study method ===<br />
Method: Study 1 and all subsequent experiments reported in this paper<br />
were conducted using a proprietary microtasking platform<br />
that outsources crowd work to workers on the Clickworker<br />
microtask market. The platform interface is similar to that<br />
of Amazon Mechanical Turk; users upload HTML task files,<br />
workers choose from a marketplace listing of tasks, and data<br />
is collected in CSV files. We restricted workers to those residing<br />
in the United States. Across all studies, 470 unique workers<br />
completed over 44 thousand tasks. A followup survey<br />
revealed that approximately 66% were female. We replicated<br />
Study 1 on Amazon Mechanical Turk and found empirically<br />
<br />
similar results, so we only report results using Clickworker in<br />
this paper.<br />
<br />
=== Method specifics and details === <br />
Primitive Crowdsourcing Task Types<br />
We began by populating our evaluation tasks with common<br />
crowdsourcing task types, or primitives, that appear commonly<br />
as microtasks or parts of microtasks. To do this, we<br />
looked at the types of tasks with the most available HITs<br />
on Amazon Mechanical Turk, at reports on common crowdsourcing<br />
task types [15], and at crowdsourcing systems described<br />
in the literature (e.g., [4]). After several iterations<br />
we identified a list of ten primitives that are present in most<br />
crowdsourcing workflows (Table 1, Figure 2). For example,<br />
the Find-Fix-Verify workflow [4] could be expressed using<br />
a combination of the FIND (identify sentences which need<br />
shortening), FIX (shortening these sentences), and BINARY<br />
primitives (verifying the shortening is an improvement). In<br />
many cases, the primitives themselves (or repetitions of the<br />
same primitive) make up the entire task, and map directly to<br />
common Mechanical Turk tasks (e.g., finding facts such as<br />
phone numbers about individuals (SEARCH)).<br />
We instantiated these primitives using a dataset of images of<br />
people performing different actions (e.g., waving, cooking)<br />
[34] and a corpus of translated Wikipedia articles selected because<br />
they tend to contain errors [1].<br />
<br />
=== Experimental Design for the study === <br />
We presented workers with a mixed series of tasks from the<br />
ten primitives and manipulated two factors: the time limit<br />
and the primitive. Each primitive had seven different possible<br />
time limits, and one untimed condition. The exact time limits<br />
were initialized using how long workers took when not under<br />
time pressure. The result was a sampled, not fully-crossed,<br />
design. For each worker we randomly selected five primitives<br />
for them to perform; for each primitive, three questions of that<br />
type were shown with each of the specified time limits. The<br />
images or text used in these questions were randomly sampled<br />
and shuffled for each worker. To minimize practice effects,<br />
workers completed three timed practice questions prior<br />
to seeing any of these conditions. The tasks were presented<br />
in randomized order, and within each primitive the time conditions<br />
were presented in randomized order. Workers were<br />
compensated $2.00 and repeat participation was disallowed.<br />
A single task was presented on each page, allowing us to<br />
record how long workers took to submit a response. Under<br />
timed conditions, a timer started as soon as the worker advanced<br />
to the next page. Input was disabled as soon as the<br />
timer expired, regardless of what the worker was doing (e.g.,<br />
typing, clicking). An example task is shown in Figure 3.<br />
<br />
=== Measures from the study === <br />
The information we logged allowed us to calculate behavioral<br />
measures for each primitive:<br />
– ETA. The ETA is the area under the error-time curve.<br />
– Time@10. We also calculated the time it takes to achieve<br />
an error rate at the 10th percentile.<br />
– Error. We measured the error rate against ground truth<br />
for each primitive. If there were many possible correct<br />
responses, we manually judged responses while blind to<br />
condition. Automatically computing distance metrics (e.g.,<br />
edit distance) resulted in empirically similar findings.<br />
– Time. We measured how long workers took to complete the<br />
primitive without any time limit.<br />
After each task block was complete, we additionally asked<br />
workers to record several subjective reflections:<br />
– Estimated time. We asked workers to report how long they<br />
thought they spent on a primitive absent time pressure.<br />
Time estimation has previously been used as an implicit<br />
signal of task difficulty [5].<br />
– Relative subjective duration (RSD). RSD, a measure of<br />
how much task time is over- or underestimated [5], is obtained<br />
by dividing the difference between estimated and<br />
actual time spent by the actual time spent.<br />
– Task load index (TLX). The NASA TLX [10] is a validated<br />
metric of mental workload commonly used in human factors<br />
research to assess task performance. It consists of a<br />
survey that sums six subjective dimensions (e.g., mental<br />
demand).<br />
A separate experimental design that contained all ten primitives,<br />
where each worker completed three untimed practice<br />
questions followed by three untimed questions for each primtive<br />
(with the primitives presented in random order), was used<br />
to obtain the<br />
– Subjective rank. Workers considered all of the primitives<br />
they completed and ranked them in order of effort required.<br />
As rankings produce sharper distinctions than individual ratings<br />
[2], we consider subjective rank to represent our ground<br />
truth ranking of the primitives. However, rank would not be a<br />
deployable solution for requesters. Ranking means that workers<br />
would need to test the new task against at least log(n)<br />
of the primitives, incurring a large fixed overhead. Further,<br />
ranking is ordinal, and cannot quantify small changes in effort.<br />
In contrast, ETA is an absolute ranking, can measure<br />
small changes in effort, and only needs to be measured for<br />
the target task to compare it with other tasks.<br />
<br />
=== What do we want to analyze? ===<br />
Analysis<br />
60 workers completed Study 1, with 30 performing each<br />
primitive. We averaged our dependent measures across all<br />
30 workers, and compared the ranking of primitives induced<br />
by each measure to the average subjective ranking (subjective<br />
rank was obtained by having 40 other workers rank all<br />
ten primitives). We used the Kendall rank correlation coeffi-<br />
cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under<br />
the null hypothesis of no association. A rank correlation<br />
of 1 indicates perfect correlation; 0 indicates no correlation.<br />
Measures that capture the subjective ranking accurately can<br />
analyze new tasks types without comparing them against multiple<br />
benchmark tasks.</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=Winter_Milestone_4_Team_BPHC_:_Quality_of_Task_Authorship_Research_Proposal_(Science)&diff=16984Winter Milestone 4 Team BPHC : Quality of Task Authorship Research Proposal (Science)2016-02-07T18:55:52Z<p>Sreenihitmunakala: /* Part 2 */</p>
<hr />
<div>Quality of task authorship has emerged as a key issue in the field of crowd sourced work. Workers on crowd sourced platforms like Amazon Mechanical Turk are concerned with how well the requester communicates the task at hand. Often times, workers complain that the Human Intelligence Tasks (HITs) posted by requesters are not clearly outlined (can be vaguely worded or poorly designed). <br />
<br />
Unclear instructions may lead to the HIT being misunderstood by the worker and subsequent rejection of the submitted work by the requester. This is an unfair loss of time, money and approval ratings for the worker. Keeping this problem in mind, workers attempt to resolve any ambiguity about the task, (eg: through email on AMT) prior to beginning work. Using external forums like TurkerNation, workers oftentimes help each other out by identifying and sharing HITs posted by requesters known to offer well designed and quality tasks. Clearly, the quality of the task authorships affects the quality of the work performed. <br />
<br />
== Research Focus : Factors Effecting Quality of Task Authorship==<br />
<br />
In this proposal, we aim to delve deeper into the issue of quality of task authorship by trying to understand what factors constitute a poorly designed task that may require communication/feedback (that should ideally be avoided) from the worker, or tasks that ultimately lead to frequent rejection. <br />
<br />
Specifically, we want to determine if experienced requesters author tasks any differently from new requesters, and similarly if experienced workers perceive the quality of tasks any differently from new workers. We would also like to extend this experiment to the Boomerang System being implemented on the Daemo Platform. In the Boomerang system, in addition to experience, we would like to see the impact of reputation/ranking on the quality of task authorship.<br />
<br />
== Hypothesis ==<br />
<br />
=== Requesters ===<br />
<br />
Our expectation is that experienced requesters and new requesters are equally likely to author tasks of poor quality (i.e, both are equally likely to reject or receive considerable negative feedback from a group of workers). This prediction is based on the existence of an immense variety of HITs posted on platforms like AMT. <br />
<br />
=== Workers ===<br />
<br />
New workers, who stand to suffer a greater decline in approval ratings may be more concerned with rejection, and may more frequently tend to give feedback or communicate. We predict that new and experienced workers will respond similarly to poorly designed tasks, and that new workers will seek clarification and may perceive lack of clarity even in well designed tasks owing to inexperience. <br />
<br />
== Experiment Design ==<br />
<br />
We have split our experiment into two parts: <br />
<br />
1. Studying how task quality is influenced by '''requester experience''' in task authorship and worker experience. <br />
<br />
2. Studying how '''new and experienced workers''' respond to poorly and well designed tasks.<br />
<br />
=== Part 1 ===<br />
<br />
[[File:Exp1f.png|600px|thumb|left|Part 1]]<br />
<br />
Three groups will be involved in this experiment.<br />
<br />
* There will be two groups of equal size: <br />
<br />
One of new requesters with little or no experience in postings HITs on crowd-sourcing platforms like Amazon Mechanical Turk (example: 0-6 months), and one of experienced requesters who have been using the platform for considerable amount of time (example: greater than 1 year). Alternatively, we may choose to do this classification based on the number of HITs posted. For example, new requesters (<50 HITs posted) and old experienced requesters (>150 HITs posted).<br />
<br />
The third group will be an equal mix of amateur and experienced workers. As experimenters, we will be aware of the experience of the worker, but to requesters posting the HITs, the workers appear as a randomized group. This enables us to generate additional information regarding the relationship between different pairs of group such as (experienced worker, experienced requester). <br />
<br />
* As experimenters, we will give the '''same task''' to be posted by both groups of requesters. For example, we will ask the requesters to post a HIT related to image annotation for a given dataset. Each requester, new or old, will post the HIT on the platform.<br />
<br />
* The group of workers will now work on the HITs posted by the two groups of workers. We will give the workers the option to email the requester regarding any unclear/poorly designed HITs. <br />
<br />
* The '''number of emails''' received by each requester in each group as well as the number of HITs successfully completed and the number of '''rejected HITs''' will be counted for each requester. <br />
<br />
* Steps 2 - 4 will be repeated with different types of tasks.For example, we next assign a translation task to be posted by the requesters, then a multiple choice survey and so on. This is to account for the variety of (and hence different levels of difficulty in) the tasks encountered in an actual crowd-sourcing platform. <br />
<br />
* Using the data generated, we will try to establish the relationship between the experience of the requester and the quality of the HIT posted (measured by number of feedback emails and rejection rate of requester).<br />
<br />
=== Part 2 ===<br />
<br />
[[File:Exp2.png|600px|thumb|left|Part 1]]<br />
<br />
This experiment is designed to understand how the experience of workers affects their response to poorly and well designed tasks. We want to understand if experienced workers are less likely to seek clarity about tasks because they've become accustomed to poorly designed tasks or if such clarifications are sought by new and experienced workers alike. <br />
<br />
The experiment: <br />
<br />
* Identify a task that has been posted in a '''poorly designed''' and a '''well designed''' form. The task can be picked from old tasks posted on Amazon Mechanical Turk. A task that has received much negative feedback (from a source like Turkopticon or Reddit) is chosen as a poorly designed task and a similar task by a requester with a high Turkopticon rating is chosen is the well designed task. There is a significant element of subjectivity involved and the tasks must be chosen carefully. <br />
<br />
* Post the poorly designed task and the well designed task, open to a pool of workers consisting of an equal number of experienced and new workers. (The definition of new and experienced will be done as described in the previous experiment).<br />
<br />
* Record the number of emails sent to the requester from the workers, the number of tasks that have to be rejected (in the context of the experience of the worker).<br />
<br />
* '''Repeat this experiment for multiple types of tasks''' (roughly 5), to account for the possible differences in ambiguity/difficulty in different categories of tasks. For example, labeling an image requires a different skill from translation of text.<br />
<br />
Analysis:<br />
<br />
The goal of the experiment is to understand how much the experience of a worker matters with respect to quality of task design. To draw conclusions about this, we intend to analyze any possible relationship between these aspects, for example, new workers may have issues or will have to seek clarification for both well and poorly designed tasks or experienced workers may react much more sharply to poorly designed tasks in comparison to newer workers.<br />
<br />
=== Extension to Boomerang Platform ===<br />
<br />
Apart from running this study on existing crowd-sourcing platforms like AMT, we extend this activity to the Daemo platform. Here, the experiment can customized to the Boomerang Ranking System by substituting experience with the rank given to requesters and workers. The two parts of the experiment can be conducted with both Prototype tasks and full scale tasks and generate information to show how well feedback between different groups in the Prototype stage improves quality of work when the full HIT is posted.<br />
<br />
== Interpreting the Results ==<br />
<br />
A correlation measure could be used to measure how closely feedback (emails send) is related to experience of the requester. For example, the Pearson coefficient could be calculated between the number of emails sent and experience (measured in months or number of past HITs posted). As per our hypothesis, we expect a Pearson correlation coefficient close to zero (implying little or no correlation between the two data series).<br />
<br />
This data can be useful to accurately determine what needs to improved in the platform. For example, if it is revealed that newer workers often seek clarification and claim lack of clarity in a task when compared to experienced workers, the follow situation may arise:<br />
<br />
A requester receives many emails and requests to clarify a task, which would indicate that the task is designed. However, the reality may be that the task was attempted by many new workers, causing the flurry of emails and may not necessarily indicate that the task design was of poor quality.<br />
<br />
The goal of the experiment is to be able to draw such insights, which will help improve the platform.<br />
<br />
== Milestone Contributors ==<br />
<br />
@aditya_nadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=Winter_Milestone_4_Team_BPHC_:_Quality_of_Task_Authorship_Research_Proposal_(Science)&diff=16983Winter Milestone 4 Team BPHC : Quality of Task Authorship Research Proposal (Science)2016-02-07T18:54:09Z<p>Sreenihitmunakala: /* Interpreting the Results */</p>
<hr />
<div>Quality of task authorship has emerged as a key issue in the field of crowd sourced work. Workers on crowd sourced platforms like Amazon Mechanical Turk are concerned with how well the requester communicates the task at hand. Often times, workers complain that the Human Intelligence Tasks (HITs) posted by requesters are not clearly outlined (can be vaguely worded or poorly designed). <br />
<br />
Unclear instructions may lead to the HIT being misunderstood by the worker and subsequent rejection of the submitted work by the requester. This is an unfair loss of time, money and approval ratings for the worker. Keeping this problem in mind, workers attempt to resolve any ambiguity about the task, (eg: through email on AMT) prior to beginning work. Using external forums like TurkerNation, workers oftentimes help each other out by identifying and sharing HITs posted by requesters known to offer well designed and quality tasks. Clearly, the quality of the task authorships affects the quality of the work performed. <br />
<br />
== Research Focus : Factors Effecting Quality of Task Authorship==<br />
<br />
In this proposal, we aim to delve deeper into the issue of quality of task authorship by trying to understand what factors constitute a poorly designed task that may require communication/feedback (that should ideally be avoided) from the worker, or tasks that ultimately lead to frequent rejection. <br />
<br />
Specifically, we want to determine if experienced requesters author tasks any differently from new requesters, and similarly if experienced workers perceive the quality of tasks any differently from new workers. We would also like to extend this experiment to the Boomerang System being implemented on the Daemo Platform. In the Boomerang system, in addition to experience, we would like to see the impact of reputation/ranking on the quality of task authorship.<br />
<br />
== Hypothesis ==<br />
<br />
=== Requesters ===<br />
<br />
Our expectation is that experienced requesters and new requesters are equally likely to author tasks of poor quality (i.e, both are equally likely to reject or receive considerable negative feedback from a group of workers). This prediction is based on the existence of an immense variety of HITs posted on platforms like AMT. <br />
<br />
=== Workers ===<br />
<br />
New workers, who stand to suffer a greater decline in approval ratings may be more concerned with rejection, and may more frequently tend to give feedback or communicate. We predict that new and experienced workers will respond similarly to poorly designed tasks, and that new workers will seek clarification and may perceive lack of clarity even in well designed tasks owing to inexperience. <br />
<br />
== Experiment Design ==<br />
<br />
We have split our experiment into two parts: <br />
<br />
1. Studying how task quality is influenced by '''requester experience''' in task authorship and worker experience. <br />
<br />
2. Studying how '''new and experienced workers''' respond to poorly and well designed tasks.<br />
<br />
=== Part 1 ===<br />
<br />
[[File:Exp1f.png|600px|thumb|left|Part 1]]<br />
<br />
Three groups will be involved in this experiment.<br />
<br />
* There will be two groups of equal size: <br />
<br />
One of new requesters with little or no experience in postings HITs on crowd-sourcing platforms like Amazon Mechanical Turk (example: 0-6 months), and one of experienced requesters who have been using the platform for considerable amount of time (example: greater than 1 year). Alternatively, we may choose to do this classification based on the number of HITs posted. For example, new requesters (<50 HITs posted) and old experienced requesters (>150 HITs posted).<br />
<br />
The third group will be an equal mix of amateur and experienced workers. As experimenters, we will be aware of the experience of the worker, but to requesters posting the HITs, the workers appear as a randomized group. This enables us to generate additional information regarding the relationship between different pairs of group such as (experienced worker, experienced requester). <br />
<br />
* As experimenters, we will give the '''same task''' to be posted by both groups of requesters. For example, we will ask the requesters to post a HIT related to image annotation for a given dataset. Each requester, new or old, will post the HIT on the platform.<br />
<br />
* The group of workers will now work on the HITs posted by the two groups of workers. We will give the workers the option to email the requester regarding any unclear/poorly designed HITs. <br />
<br />
* The '''number of emails''' received by each requester in each group as well as the number of HITs successfully completed and the number of '''rejected HITs''' will be counted for each requester. <br />
<br />
* Steps 2 - 4 will be repeated with different types of tasks.For example, we next assign a translation task to be posted by the requesters, then a multiple choice survey and so on. This is to account for the variety of (and hence different levels of difficulty in) the tasks encountered in an actual crowd-sourcing platform. <br />
<br />
* Using the data generated, we will try to establish the relationship between the experience of the requester and the quality of the HIT posted (measured by number of feedback emails and rejection rate of requester).<br />
<br />
=== Part 2 ===<br />
<br />
[[File:Exp2.png|600px|thumb|left|Part 1]]<br />
<br />
This experiment is designed to understand how the experience of workers affects their response to poorly and well designed tasks. We want to understand if experienced workers are less likely to seek clarity about tasks because they've become accustomed to poorly designed tasks or if such clarifications are sought by new and experienced workers alike. <br />
<br />
The experiment: <br />
<br />
* Identify a task that has been posted in a '''poorly designed''' and a '''well designed''' form. The task can be picked from old tasks posted on Amazon Mechanical Turk. A task that has received much negative feedback (from a source like Turkopticon or Reddit) is chosen as a poorly designed task and a similar task by a requester with a high Turkopticon rating is chosen is the well designed task. There is a significant element of subjectivity involved and the tasks must be chosen carefully. <br />
<br />
* Post the poorly designed task and the well designed task, open to a pool of workers consisting of an equal number of experienced and new workers. (The definition of new and experienced will be done as described in the previous experiment).<br />
<br />
* Record the number of emails sent to the requester from the workers, the number of tasks that have to be rejected (in the context of the experience of the worker).<br />
<br />
* Repeat this experiment for multiple types of tasks (roughly 5), to account for the possible differences in ambiguity/difficulty in different categories of tasks. For example, labeling an image requires a different skill from translation of text.<br />
<br />
Analysis:<br />
<br />
The goal of the experiment is to understand how much the experience of a worker matters with respect to quality of task design. To draw conclusions about this, we intend to analyze any possible relationship between these aspects, for example, new workers may have issues or will have to seek clarification for both well and poorly designed tasks or experienced workers may react much more sharply to poorly designed tasks in comparison to newer workers.<br />
<br />
=== Extension to Boomerang Platform ===<br />
<br />
Apart from running this study on existing crowd-sourcing platforms like AMT, we extend this activity to the Daemo platform. Here, the experiment can customized to the Boomerang Ranking System by substituting experience with the rank given to requesters and workers. The two parts of the experiment can be conducted with both Prototype tasks and full scale tasks and generate information to show how well feedback between different groups in the Prototype stage improves quality of work when the full HIT is posted.<br />
<br />
== Interpreting the Results ==<br />
<br />
A correlation measure could be used to measure how closely feedback (emails send) is related to experience of the requester. For example, the Pearson coefficient could be calculated between the number of emails sent and experience (measured in months or number of past HITs posted). As per our hypothesis, we expect a Pearson correlation coefficient close to zero (implying little or no correlation between the two data series).<br />
<br />
This data can be useful to accurately determine what needs to improved in the platform. For example, if it is revealed that newer workers often seek clarification and claim lack of clarity in a task when compared to experienced workers, the follow situation may arise:<br />
<br />
A requester receives many emails and requests to clarify a task, which would indicate that the task is designed. However, the reality may be that the task was attempted by many new workers, causing the flurry of emails and may not necessarily indicate that the task design was of poor quality.<br />
<br />
The goal of the experiment is to be able to draw such insights, which will help improve the platform.<br />
<br />
== Milestone Contributors ==<br />
<br />
@aditya_nadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=Winter_Milestone_4_Team_BPHC_:_Quality_of_Task_Authorship_Research_Proposal_(Science)&diff=16973Winter Milestone 4 Team BPHC : Quality of Task Authorship Research Proposal (Science)2016-02-07T18:43:26Z<p>Sreenihitmunakala: /* Experiment Design */</p>
<hr />
<div>Quality of task authorship has emerged as a key issue in the field of crowd sourced work. Workers on crowd sourced platforms like Amazon Mechanical Turk are concerned with how well the requester communicates the task at hand. Often times, workers complain that the Human Intelligence Tasks (HITs) posted by requesters are not clearly outlined (can be vaguely worded or poorly designed). <br />
<br />
Unclear instructions may lead to the HIT being misunderstood by the worker and subsequent rejection of the submitted work by the requester. This is an unfair loss of time, money and approval ratings for the worker. Keeping this problem in mind, workers attempt to resolve any ambiguity about the task, (eg: through email on AMT) prior to beginning work. Using external forums like TurkerNation, workers oftentimes help each other out by identifying and sharing HITs posted by requesters known to offer well designed and quality tasks. Clearly, the quality of the task authorships affects the quality of the work performed. <br />
<br />
== Research Focus : Factors Effecting Quality of Task Authorship==<br />
<br />
In this proposal, we aim to delve deeper into the issue of quality of task authorship by trying to understand what factors constitute a poorly designed task that may require communication/feedback (that should ideally be avoided) from the worker, or tasks that ultimately lead to frequent rejection. <br />
<br />
Specifically, we want to determine if experienced requesters author tasks any differently from new requesters, and similarly if experienced workers perceive the quality of tasks any differently from new workers. We would also like to extend this experiment to the Boomerang System being implemented on the Daemo Platform. In the Boomerang system, in addition to experience, we would like to see the impact of reputation/ranking on the quality of task authorship.<br />
<br />
== Hypothesis ==<br />
<br />
=== Requesters ===<br />
<br />
Our expectation is that experienced requesters and new requesters are equally likely to author tasks of poor quality (i.e, both are equally likely to reject or receive considerable negative feedback from a group of workers). This prediction is based on the existence of an immense variety of HITs posted on platforms like AMT. <br />
<br />
=== Workers ===<br />
<br />
New workers, who stand to suffer a greater decline in approval ratings may be more concerned with rejection, and may more frequently tend to give feedback or communicate. We predict that new and experienced workers will respond similarly to poorly designed tasks, and that new workers will seek clarification and may perceive lack of clarity even in well designed tasks owing to inexperience. <br />
<br />
== Experiment Design ==<br />
<br />
We have split our experiment into two parts: <br />
<br />
1. Studying how task quality is influenced by '''requester experience''' in task authorship and worker experience. <br />
<br />
2. Studying how '''new and experienced workers''' respond to poorly and well designed tasks.<br />
<br />
=== Part 1 ===<br />
<br />
[[File:Exp1f.png|600px|thumb|left|Part 1]]<br />
<br />
Three groups will be involved in this experiment.<br />
<br />
* There will be two groups of equal size: <br />
<br />
One of new requesters with little or no experience in postings HITs on crowd-sourcing platforms like Amazon Mechanical Turk (example: 0-6 months), and one of experienced requesters who have been using the platform for considerable amount of time (example: greater than 1 year). Alternatively, we may choose to do this classification based on the number of HITs posted. For example, new requesters (<50 HITs posted) and old experienced requesters (>150 HITs posted).<br />
<br />
The third group will be an equal mix of amateur and experienced workers. As experimenters, we will be aware of the experience of the worker, but to requesters posting the HITs, the workers appear as a randomized group. This enables us to generate additional information regarding the relationship between different pairs of group such as (experienced worker, experienced requester). <br />
<br />
* As experimenters, we will give the '''same task''' to be posted by both groups of requesters. For example, we will ask the requesters to post a HIT related to image annotation for a given dataset. Each requester, new or old, will post the HIT on the platform.<br />
<br />
* The group of workers will now work on the HITs posted by the two groups of workers. We will give the workers the option to email the requester regarding any unclear/poorly designed HITs. <br />
<br />
* The '''number of emails''' received by each requester in each group as well as the number of HITs successfully completed and the number of '''rejected HITs''' will be counted for each requester. <br />
<br />
* Steps 2 - 4 will be repeated with different types of tasks.For example, we next assign a translation task to be posted by the requesters, then a multiple choice survey and so on. This is to account for the variety of (and hence different levels of difficulty in) the tasks encountered in an actual crowd-sourcing platform. <br />
<br />
* Using the data generated, we will try to establish the relationship between the experience of the requester and the quality of the HIT posted (measured by number of feedback emails and rejection rate of requester).<br />
<br />
=== Part 2 ===<br />
<br />
[[File:Exp2.png|600px|thumb|left|Part 1]]<br />
<br />
This experiment is designed to understand how the experience of workers affects their response to poorly and well designed tasks. We want to understand if experienced workers are less likely to seek clarity about tasks because they've become accustomed to poorly designed tasks or if such clarifications are sought by new and experienced workers alike. <br />
<br />
The experiment: <br />
<br />
* Identify a task that has been posted in a '''poorly designed''' and a '''well designed''' form. The task can be picked from old tasks posted on Amazon Mechanical Turk. A task that has received much negative feedback (from a source like Turkopticon or Reddit) is chosen as a poorly designed task and a similar task by a requester with a high Turkopticon rating is chosen is the well designed task. There is a significant element of subjectivity involved and the tasks must be chosen carefully. <br />
<br />
* Post the poorly designed task and the well designed task, open to a pool of workers consisting of an equal number of experienced and new workers. (The definition of new and experienced will be done as described in the previous experiment).<br />
<br />
* Record the number of emails sent to the requester from the workers, the number of tasks that have to be rejected (in the context of the experience of the worker).<br />
<br />
* Repeat this experiment for multiple types of tasks (roughly 5), to account for the possible differences in ambiguity/difficulty in different categories of tasks. For example, labeling an image requires a different skill from translation of text.<br />
<br />
Analysis:<br />
<br />
The goal of the experiment is to understand how much the experience of a worker matters with respect to quality of task design. To draw conclusions about this, we intend to analyze any possible relationship between these aspects, for example, new workers may have issues or will have to seek clarification for both well and poorly designed tasks or experienced workers may react much more sharply to poorly designed tasks in comparison to newer workers.<br />
<br />
=== Extension to Boomerang Platform ===<br />
<br />
Apart from running this study on existing crowd-sourcing platforms like AMT, we extend this activity to the Daemo platform. Here, the experiment can customized to the Boomerang Ranking System by substituting experience with the rank given to requesters and workers. The two parts of the experiment can be conducted with both Prototype tasks and full scale tasks and generate information to show how well feedback between different groups in the Prototype stage improves quality of work when the full HIT is posted.<br />
<br />
== Interpreting the Results ==<br />
<br />
A correlation measure could be used to measure how closely feedback (emails send) is related to experience of the requester. For example, the Pearson coefficient could be calculated between the number of emails sent and experience (measured in months or number of past HITs posted). As per our hypothesis, we expect a Pearson correlation coefficient close to zero (implying little or no correlation between the two data series).<br />
<br />
== Milestone Contributors ==<br />
<br />
@aditya_nadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=Winter_Milestone_4_Team_BPHC_:_Quality_of_Task_Authorship_Research_Proposal_(Science)&diff=16967Winter Milestone 4 Team BPHC : Quality of Task Authorship Research Proposal (Science)2016-02-07T18:40:11Z<p>Sreenihitmunakala: /* Part 2 */</p>
<hr />
<div>Quality of task authorship has emerged as a key issue in the field of crowd sourced work. Workers on crowd sourced platforms like Amazon Mechanical Turk are concerned with how well the requester communicates the task at hand. Often times, workers complain that the Human Intelligence Tasks (HITs) posted by requesters are not clearly outlined (can be vaguely worded or poorly designed). <br />
<br />
Unclear instructions may lead to the HIT being misunderstood by the worker and subsequent rejection of the submitted work by the requester. This is an unfair loss of time, money and approval ratings for the worker. Keeping this problem in mind, workers attempt to resolve any ambiguity about the task, (eg: through email on AMT) prior to beginning work. Using external forums like TurkerNation, workers oftentimes help each other out by identifying and sharing HITs posted by requesters known to offer well designed and quality tasks. Clearly, the quality of the task authorships affects the quality of the work performed. <br />
<br />
== Research Focus : Factors Effecting Quality of Task Authorship==<br />
<br />
In this proposal, we aim to delve deeper into the issue of quality of task authorship by trying to understand what factors constitute a poorly designed task that may require communication/feedback (that should ideally be avoided) from the worker, or tasks that ultimately lead to frequent rejection. <br />
<br />
Specifically, we want to determine if experienced requesters author tasks any differently from new requesters, and similarly if experienced workers perceive the quality of tasks any differently from new workers. We would also like to extend this experiment to the Boomerang System being implemented on the Daemo Platform. In the Boomerang system, in addition to experience, we would like to see the impact of reputation/ranking on the quality of task authorship.<br />
<br />
== Hypothesis ==<br />
<br />
=== Requesters ===<br />
<br />
Our expectation is that experienced requesters and new requesters are equally likely to author tasks of poor quality (i.e, both are equally likely to reject or receive considerable negative feedback from a group of workers). This prediction is based on the existence of an immense variety of HITs posted on platforms like AMT. <br />
<br />
=== Workers ===<br />
<br />
New workers, who stand to suffer a greater decline in approval ratings may be more concerned with rejection, and may more frequently tend to give feedback or communicate. We predict that new and experienced workers will respond similarly to poorly designed tasks, and that new workers will seek clarification and may perceive lack of clarity even in well designed tasks owing to inexperience. <br />
<br />
== Experiment Design ==<br />
<br />
We have split our experiment into two parts: <br />
<br />
1. Studying how task quality is influenced by requester experience in task authorship and worker experience. <br />
<br />
2. Studying how new and experienced workers respond to poorly and well designed tasks.<br />
<br />
=== Part 1 ===<br />
<br />
[[File:Exp1f.png|600px|thumb|left|Part 1]]<br />
<br />
Three groups will be involved in this experiment.<br />
<br />
* There will be two groups of equal size: <br />
<br />
One of new requesters with little or no experience in postings HITs on crowd-sourcing platforms like Amazon Mechanical Turk (example: 0-6 months), and one of experienced requesters who have been using the platform for considerable amount of time (example: greater than 1 year). Alternatively, we may choose to do this classification based on the number of HITs posted. For example, new requesters (<50 HITs posted) and old experienced requesters (>150 HITs posted).<br />
<br />
The third group will be an equal mix of amateur and experienced workers. As experimenters, we will be aware of the experience of the worker, but to requesters posting the HITs, the workers appear as a randomized group. This enables us to generate additional information regarding the relationship between different pairs of group such as (experienced worker, experienced requester). <br />
<br />
* As experimenters, we will give the same task to be posted by both groups of requesters. For example, we will ask the requesters to post a HIT related to image annotation for a given dataset. Each requester, new or old, will post the HIT on the platform.<br />
<br />
* The group of workers will now work on the HITs posted by the two groups of workers. We will give the workers the option to email the requester regarding any unclear/poorly designed HITs. <br />
<br />
* The number of emails received by each requester in each group as well as the number of HITs successfully completed and the number of rejected HITs will be counted for each requester. <br />
<br />
* Steps 2 - 4 will be repeated with different types of tasks.For example, we next assign a translation task to be posted by the requesters, then a multiple choice survey and so on. This is to account for the variety of (and hence different levels of difficulty in) the tasks encountered in an actual crowd-sourcing platform. <br />
<br />
* Using the data generated, we will try to establish the relationship between the experience of the requester and the quality of the HIT posted (measured by number of feedback emails and rejection rate of requester).<br />
<br />
=== Part 2 ===<br />
<br />
[[File:Exp2.png|600px|thumb|left|Part 1]]<br />
<br />
This experiment is designed to understand how the experience of workers affects their response to poorly and well designed tasks. We want to understand if experienced workers are less likely to seek clarity about tasks because they've become accustomed to poorly designed tasks or if such clarifications are sought by new and experienced workers alike. <br />
<br />
The experiment: <br />
<br />
* Identify a task that has been posted in a '''poorly designed''' and a '''well designed''' form. The task can be picked from old tasks posted on Amazon Mechanical Turk. A task that has received much negative feedback (from a source like Turkopticon or Reddit) is chosen as a poorly designed task and a similar task by a requester with a high Turkopticon rating is chosen is the well designed task. There is a significant element of subjectivity involved and the tasks must be chosen carefully. <br />
<br />
* Post the poorly designed task and the well designed task, open to a pool of workers consisting of an equal number of experienced and new workers. (The definition of new and experienced will be done as described in the previous experiment).<br />
<br />
* Record the number of emails sent to the requester from the workers, the number of tasks that have to be rejected (in the context of the experience of the worker).<br />
<br />
* Repeat this experiment for multiple types of tasks (roughly 5), to account for the possible differences in ambiguity/difficulty in different categories of tasks. For example, labeling an image requires a different skill from translation of text.<br />
<br />
Analysis:<br />
<br />
The goal of the experiment is to understand how much the experience of a worker matters with respect to quality of task design. To draw conclusions about this, we intend to analyze any possible relationship between these aspects, for example, new workers may have issues or will have to seek clarification for both well and poorly designed tasks or experienced workers may react much more sharply to poorly designed tasks in comparison to newer workers.<br />
<br />
=== Extension to Boomerang Platform ===<br />
<br />
Apart from running this study on existing crowd-sourcing platforms like AMT, we extend this activity to the Daemo plaform. Here, the experiment can customized to the Boomerang Ranking System by substituting experience with the rank given to requesters and workers. The two parts of the experiment can be conducted with both Prototype tasks and full scale tasks and generate information to show how well feedback between different groups in the Prototype stage improves quality of work when the full HIT is posted. <br />
<br />
== Interpreting the Results ==<br />
<br />
A correlation measure could be used to measure how closely feedback (emails send) is related to experience of the requester. For example, the Pearson coefficient could be calculated between the number of emails sent and experience (measured in months or number of past HITs posted). As per our hypothesis, we expect a Pearson correlation coefficient close to zero (implying little or no correlation between the two data series).<br />
<br />
== Milestone Contributors ==<br />
<br />
@aditya_nadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=Winter_Milestone_4_Team_BPHC_:_Quality_of_Task_Authorship_Research_Proposal_(Science)&diff=16943Winter Milestone 4 Team BPHC : Quality of Task Authorship Research Proposal (Science)2016-02-07T17:37:42Z<p>Sreenihitmunakala: /* Part 2 */</p>
<hr />
<div>Quality of task authorship has emerged as a key issue in the field of crowd sourced work. Workers on crowd sourced platforms like Amazon Mechanical Turk are concerned with how well the requester communicates the task at hand. Often times, workers complain that the Human Intelligence Tasks (HITs) posted by requesters are not clearly outlined/(can be vaguely worded or poorly designed. <br />
<br />
Unclear instructions may lead to the HIT being misunderstood by the worker and subsequent rejection of the submitted work by the requester. This is an unfair loss of time, money and approval ratings for the worker. Keeping this problem in mind, workers attempt to resolve any ambiguity about the task, (eg: through email on AMT) prior to beginning work. Using external forums like TurkerNation, workers oftentimes help each other out by identifying and sharing HITs posted by requesters known to offer well designed and quality tasks. Clearly, the quality of the task authorships affects the quality of the work performed. <br />
<br />
== Research Focus : Factors Effecting Quality of Task Authorship==<br />
<br />
In this proposal, we aim to delve deeper into the issue of quality of task authorship by trying to understand what factors constitute a poorly designed task that may require communication/feedback (that should ideally be avoided) from the worker, or tasks that ultimately lead to frequent rejection. <br />
<br />
Specifically, we want to determine if experienced requesters author tasks any differently from new requesters, and similarly if experienced workers perceive the quality of tasks any differently from new workers. We would also like to extend this experiment to the Boomerang System being implemented on the Daemo Platform. In the Boomerang system, in addition to experience, we would like to see the impact of reputation/ranking on the quality of task authorship.<br />
<br />
== Hypothesis ==<br />
<br />
=== Requesters ===<br />
<br />
Our expectation is that experienced requesters and new requesters are equally likely to author tasks of poor quality (i.e, both are equally likely to reject or receive considerable negative feedback from a group of workers). This prediction is based on the existence of an immense variety of HITs posted on platforms like AMT. <br />
<br />
=== Workers ===<br />
<br />
New workers, who stand to suffer a greater decline in approval ratings may be more concerned with rejection, and may more frequently tend to give feedback or communicate. We predict that new and experienced workers will respond similarly to poorly designed tasks, and that new workers will seek clarification and may perceive lack of clarity even in well designed tasks owing to inexperience. <br />
<br />
== Experiment Design ==<br />
<br />
We have split our experiment into two parts: <br />
<br />
1. Studying how task quality is influenced by requester experience in task authorship and worker experience. <br />
<br />
2. Studying how new and experienced workers respond to poorly and well designed tasks.<br />
<br />
=== Part 1 ===<br />
<br />
[[File:Exp1.png|600px|thumb|left|Part 1]]<br />
<br />
Three groups will be involved in this experiment.<br />
<br />
* There will be two groups of equal size: <br />
<br />
One of new requesters with little or no experience in postings HITs on crowd-sourcing platforms like Amazon Mechanical Turk (example: 0-6 months), and one of experienced requesters who have been using the platform for considerable amount of time (example: greater than 1 year). Alternatively, we may choose to do this classification based on the number of HITs posted. For example, new requesters (<50 HITs posted) and old experienced requesters (>150 HITs posted).<br />
<br />
The third group will be a random mix of amateur and experienced workers.<br />
<br />
* As experimenters, we will give the same task to be posted by both groups of requesters. For example, we will ask the requesters to post a HIT related to image annotation for a given dataset. Each requester, new or old, will post the HIT on the platform.<br />
<br />
* The group of workers will now work on the HITs posted by the two groups of workers. We will give the workers the option to email the requester regarding any unclear/poorly designed HITs. <br />
<br />
* The number of emails received by each requester in each group as well as the number of HITs successfully completed and the number of rejected HITs will be counted for each requester. <br />
<br />
* Steps 2 - 4 will be repeated with different types of tasks.For example, we next assign a translation task to be posted by the requesters, then a multiple choice survey and so on. This is to account for the variety of (and hence different levels of difficulty in) the tasks encountered in an actual crowd-sourcing platform. <br />
<br />
* Using the data generated, we will try to establish the relationship between the experience of the requester and the quality of the HIT posted (measured by number of feedback emails and rejection rate of requester).<br />
<br />
=== Part 2 ===<br />
<br />
This experiment is designed to understand how the experience of workers affects their response to poorly and well designed tasks. We want to understand if experienced workers are less likely to seek clarity about tasks because they've become accustomed to poorly designed tasks or if such clarifications are sought by new and experienced workers alike. <br />
<br />
The experiment: <br />
<br />
* Identify a task that has been posted in a poorly designed and a well designed form. The task can be picked from old tasks posted on Amazon Mechanical Turk.<br />
<br />
* Post the poorly designed task and the well designed task, open to a pool of workers consisting of an equal number of experienced and new workers. (The definition of new and experienced will be done as described in the previous experiment).<br />
<br />
* Record the number of emails sent to the requesters from the workers, the number of tasks that have to be rejected (in the context of the experience of the worker).<br />
<br />
* Repeat this experiment for multiple types of tasks (roughly 5), to account for the possible differences in ambiguity/difficulty in different categories of tasks. For example, labeling an image requires a different skill from translation of text.<br />
<br />
Analysis:<br />
<br />
The goal of the experiment is to understand how much the experience of a worker matters with respect to quality of task design. To draw conclusions about this, we intend to analyze any possible relationship between these aspects, for example, new workers may have issues or will have to seek clarification for both well and poorly designed tasks or that experienced workers react much more sharply to poorly designed tasks in comparison to newer workers.<br />
<br />
=== Extension to Boomerang Platform ===<br />
<br />
== Interpreting the Results ==<br />
<br />
A correlation measure could be used to measure how closely feedback (emails send) is related to experience of the requester. For example, the Pearson coefficient could be calculated between the number of emails sent and experience (measured in months or number of past HITs posted). As per our hypothesis, we expect a Pearson correlation coefficient close to zero (implying little or no correlation between the two data series).<br />
<br />
== Milestone Contributors ==<br />
<br />
@aditya_nadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_RepresentationIdea_:_Community_Based_Third-Party_Arbitrators_For_Conflict_Resolution&diff=15828WinterMilestone 3 BPHC RepresentationIdea : Community Based Third-Party Arbitrators For Conflict Resolution2016-01-31T17:46:41Z<p>Sreenihitmunakala: /* Milestone Contributors */</p>
<hr />
<div>'''Our representation idea is a system to resolve issues between requesters and workers using community elected arbitrators.''' <br />
<br />
One area of concern is how issues and conflicts are resolved between requesters and workers. The most common issue is that of rejection of work by requesters. This can occur due to requester's dissatisfaction or even honest mistakes. We propose a community based third-party arbitration based solution.<br />
<br />
== Solution ==<br />
<br />
Our solution to this problem is a system where conflicts are resolved by third party members. This system takes ideas from the worker panel discussion. The system would work as follows:<br />
<br />
Workers and Requesters pick third party persons to serve as arbitrators in an instance of conflict between the requester and workers. A criteria can is set to be eligible as an arbitrator, on the basis of experience with the platform, level of participation in the community, reputation and other factors. Workers may vote to elect these arbitrator, and a monetary/reputational incentive may be offered to workers to stand for the position.<br />
<br />
When a task is to be posted, requesters are shown a choice of judges to pick from, with their reputation and (more importantly) the number of tasks completed for them. '''Example''':<br />
<br />
'''Arbitrator 1'''<br />
Worked on '''10''' of your HITS <br />
You rated him ''''Good'''' for '''98'''% of those HITS.<br />
<br />
'''Arbitrator 2'''<br />
Worked on '''14''' of your HITS <br />
You rated him ''''Good'''' for '''88'''% of those HITS.<br />
<br />
The requester will be given the option to pick an arbitrator, sending the message to workers that work will not be arbitrarily rejected. This will give confidence to workers to take up the task without the fear of work being arbitrarily rejected. <br />
<br />
<br />
=== Mechanism ===<br />
<br />
The mechanism for conflict resolution is as follows:<br />
<br />
- Worker submits completed HITs.<br />
- Requester decides to reject the HITs, citing his/her reason.<br />
- Worker disagrees and submits for reconsideration. <br />
- Requester rejects HITs '''again'''.<br />
- Worker applies for arbitration.<br />
- Arbitrator makes his decision and sends it to the requester.<br />
Possibilities:<br />
1. Requester abides by decision.<br />
2. Requester disagrees.<br />
The ratio of the number of judgements rejected by the requester '''is displayed on the task page'''. <br />
This serves as an incentive for requesters to respect judgements made by arbitrators as a '''low ratio will dissuade workers'''.<br />
<br />
This system's value comes from the fact that the arbitrators will be mutually agreed upon. Requesters that want to attract the best workers can opt to pick an arbitrator, which will make the '''workers less likely to be wary of a requester''', a new requester especially, when it comes to spending time on a HIT.<br />
<br />
=== Limitations: ===<br />
<br />
Voting may not be the best system for identifying arbitrators. The incentive to become an arbitrator has to be good enough to overcome reservations workers may have, such as not willing to become involved in a dispute between workers and requesters, especially when the arbitrator may seek work from the requester at a later point.<br />
<br />
== Milestone Contributors ==<br />
<br />
@sreenihit, @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_DarkHorseIdea:_Pure_Referral_Based_Structure&diff=15827WinterMilestone 3 BPHC DarkHorseIdea: Pure Referral Based Structure2016-01-31T17:46:26Z<p>Sreenihitmunakala: /* Milestone Contributors */</p>
<hr />
<div>'''Our dark horse proposal is a pure referral based system for worker assignment to requester's tasks.''' <br />
<br />
The assumption behind this system is that requesters often see poor quality of work, which is a waste of time and resources. There is a need to match the specific requirements of requesters with the workers who are capable of executing such tasks. <br />
<br />
== Platform Stucture ==<br />
<br />
We propose a system where a requester initially selects a small number (maybe 3-5) of highly reputed workers. These workers, in addition to completing the task, will have the responsibility/privilege of referring or inviting other workers they deem fit to perform the task, who in turn will be allowed to invite a few more workers, and so on, extending to several levels. The number of people a worker can invite depends on his/her level in the structure. In this way, a group of workers will be assembled to work on the given task.<br />
<br />
[[File:stn.jpg]]<br />
<br />
=== Mechanism ===<br />
<br />
Worker '''A''' refers worker '''B'''. <br />
<br />
'''Scenario 1''':<br />
- Worker '''B''' is rated 'good' or highly as per the rating scheme employed.<br />
- Worker '''A''''s ranking increases proportional to '''A''''s position in the structure.<br />
- If worker '''A''' refers a sufficient number of good workers, Worker '''A''' receives monetary benefit and moves up a level in the structure.<br />
<br />
<br />
<br />
'''Scenario 2''':<br />
- Worker '''B''' is rated 'bad' or poorly as per the rating scheme employed.<br />
- Worker '''A''''s ranking decreases proportional to '''A''''s position in the structure.<br />
- If worker '''A''' refers a certain number of poor workers, Worker '''A''' moves down a level in the structure.<br />
<br />
- If worker '''B''' is consistently rated poorly, he/she is removed from the task.<br />
<br />
In this way, the system incentivizes every worker referral. By penalizing workers who refer poor workers and offering benefits for workers to refer good workers, the structure ensures that requesters will get the best work.<br />
<br />
This structure is also useful because is can allow requesters to assign responsibility to some of the workers. By picking the first layer of the workers, requesters assign the responsibility of picking other workers to the workers themselves. This solves two problems:<br />
<br />
1. The issue of requesters being unable to handle the flood of queries, concerns and other communication from workers is handled <br />
by some workers.<br />
Each worker can handle questions and queries that arise from workers that he/she referred for the task. The worker in turn <br />
recieves infomration from the worker who referred him/her. <br />
<br />
2. The first set of workers can help assist the requester in designing a proper task, and can serve as sufficient feedback to <br />
resolve any issues. <br />
The worker will refer another worker to this task only if he/she is satisfied with the task's design.<br />
<br />
=== Assumptions/Limitations: ===<br />
<br />
*The system also assumes that workers who excel at a task will be good judges of other people's ability to perform the same task. It also assumes that there is an excess of workers, which is usually the case.<br />
<br />
*The structure may limit tasks to people who can network with or seek out people who can pick them for the task. While it may restrict newcomers, it also allows workers with low reputation to participate in tasks that they normally wouldn't be eligible for, giving them a chance at redemption.<br />
<br />
*If the incentives are poorly designed, the emphasis may move away from the task and to the referrals instead.<br />
<br />
== Milestone Contributors ==<br />
<br />
@sreenihit, @aditya_nadimpalli</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15826WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:46:01Z<p>Sreenihitmunakala: /* Milestone Contributors */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''matching requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== '''Referral Network''' for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester '''R''' has a HIT to post on the crowd-sourcing platform. '''R''' can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker '''W''' notices the HIT posted by '''R''' and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. '''W''' refers workers he or she knows to R.<br />
<br />
::2. '''W''' can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3. '''W''' recieves benefit for making a referral. <br />
<br />
The quality of referrals is incentivized in a number of ways, such as including a '''measure of good referrals in the scores used to rank workers and requesters'''. A separate index could also be used for referrals or requesters may reward '''good referrals with bonus payments'''.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of '''Newcomers'''==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@aditya_nadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15821WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:44:27Z<p>Sreenihitmunakala: /* Reputation of Newcomers */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''matching requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== '''Referral Network''' for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester '''R''' has a HIT to post on the crowd-sourcing platform. '''R''' can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker '''W''' notices the HIT posted by '''R''' and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. '''W''' refers workers he or she knows to R.<br />
<br />
::2. '''W''' can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3. '''W''' recieves benefit for making a referral. <br />
<br />
The quality of referrals is incentivized in a number of ways, such as including a '''measure of good referrals in the scores used to rank workers and requesters'''. A separate index could also be used for referrals or requesters may reward '''good referrals with bonus payments'''.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of '''Newcomers'''==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15820WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:44:04Z<p>Sreenihitmunakala: /* Referral Network for Workers */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''matching requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== '''Referral Network''' for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester '''R''' has a HIT to post on the crowd-sourcing platform. '''R''' can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker '''W''' notices the HIT posted by '''R''' and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. '''W''' refers workers he or she knows to R.<br />
<br />
::2. '''W''' can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3. '''W''' recieves benefit for making a referral. <br />
<br />
The quality of referrals is incentivized in a number of ways, such as including a '''measure of good referrals in the scores used to rank workers and requesters'''. A separate index could also be used for referrals or requesters may reward '''good referrals with bonus payments'''.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of Newcomers==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15817WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:42:49Z<p>Sreenihitmunakala: /* Implementation of a Referral Network for Crowdsourcing */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''matching requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== Referral Network for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester '''R''' has a HIT to post on the crowd-sourcing platform. '''R''' can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker '''W''' notices the HIT posted by '''R''' and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. '''W''' refers workers he or she knows to R.<br />
<br />
::2. '''W''' can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3. '''W''' recieves benefit for making a referral. <br />
<br />
The quality of referrals is incentivized in a number of ways, such as including a '''measure of good referrals in the scores used to rank workers and requesters'''. A separate index could also be used for referrals or requesters may reward '''good referrals with bonus payments'''.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of Newcomers==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15816WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:41:23Z<p>Sreenihitmunakala: /* Implementation of a Referral Network for Crowdsourcing */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''matching requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== Referral Network for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester '''R''' has a HIT to post on the crowd-sourcing platform. '''R''' can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker '''W''' notices the HIT posted by '''R''' and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. '''W''' refers workers he or she knows to R.<br />
<br />
::2. '''W''' can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3. '''W''' recieves benefit for making a referral. <br />
<br />
The quality of referrals is incentivized in a number of ways, such as including a measure of good referrals in the scores used to rank workers and requesters. A separate index could also be used for referrals or requesters may reward good referrals with bonus payments.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of Newcomers==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15815WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:40:55Z<p>Sreenihitmunakala: /* Implementation of a Referral Network for Crowdsourcing */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''matching requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== Referral Network for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester '''R''' has a HIT to post on the crowd-sourcing platform. R can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker '''W''' notices the HIT posted by '''R''' and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. '''W''' refers workers he or she knows to R.<br />
<br />
::2. '''W''' can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3. '''W''' recieves benefit for making a referral. <br />
<br />
The quality of referrals is incentivized in a number of ways, such as including a measure of good referrals in the scores used to rank workers and requesters. A separate index could also be used for referrals or requesters may reward good referrals with bonus payments.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of Newcomers==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15809WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:38:28Z<p>Sreenihitmunakala: /* Implementation of a Referral Network for Crowdsourcing */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''macthing requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== Referral Network for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester R has a HIT to post on the crowd-sourcing platform. R can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker W notices the HIT posted by R and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. W refers workers he or she knows to R.<br />
<br />
::2. W can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
::3 The quality of referrals is incentivized in a number of ways, such as including a measure of good referrals in the scores used to rank workers and requesters. A separate index could also be used for referrals or requesters may reward good referrals with bonus payments.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of Newcomers==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakalahttp://crowdresearch.stanford.edu/w/index.php?title=WinterMilestone_3_BPHC_ReputationIdea_:_Referral_Network_for_Workers&diff=15799WinterMilestone 3 BPHC ReputationIdea : Referral Network for Workers2016-01-31T17:32:44Z<p>Sreenihitmunakala: /* Assisting Requesters in Finding the Right Workers at the Right Time */</p>
<hr />
<div>One of the major needs identified in existing platforms for crowd-sourcing is the ability of the platform to effectively match capable workers with suitable jobs or HITs (Human Intelligence Tasks). The Stanford Crowd Research Team has thus far developed their Daemo Platform, which makes a significant step in meeting this requirement through the Boomerang Ranking System. <br />
<br />
We propose a solution consisting of three design aspects, that attempt to solve needs related to - '''worker reputation''', '''macthing requester and worker requirements''' and the '''high bar for entry for newcomers'''.<br />
<br />
<br />
== Referral Network for Workers ==<br />
<br />
We propose a referral based system that deals with the following reputation and relevant work issues:<br />
<br />
* A problem with a system that prefers assigning tasks to workers with high reputation, is that '''new workers''' may find it difficult to find good HITs. Requesters would tend to allot jobs to the same set of highly rated users. Similarly, existing workers would tend to take up HITs from the same set of requesters with high ratings.<br />
<br />
* Workers rely on external forums (TurkerNation, Reddit etc) to find good HITs from reliable requesters. Clearly any future crowd-sourcing platform should reduce this burden on the worker by allowing dedicated workers to easily share good HITs and help fellow workers to find relevant work .<br />
<br />
=== Introduction ===<br />
<br />
The employee referral network in the corporate world is known to be a cost and time effective method of recruitment that produces high quality candidates. 92% of the participants in the Global Employee Referral Index 2013 Survey stated that referrals were a top source of recruitment. We thought about adapting the referral system to our crowd sourcing platform design. This would enable workers to recommend or “refer” other workers (existing or new) for good HITs.<br />
=== Implementation of a Referral Network for Crowdsourcing ===<br />
<br />
We illustrate the use of the referral system through the following example.<br />
<br />
1. Requester R has a HIT to post on the crowd-sourcing platform. R can view highly rated workers using Boomerang and make the HIT visible to them (or post the HIT publicly for all to see).<br />
<br />
2. Worker W notices the HIT posted by R and has the following options:<br />
<br />
* Accept the HIT.<br />
<br />
* Share the HIT with fellow workers. In the referral based scheme, this can be done in the following ways:<br />
<br />
::1. W refers workers he or she knows to R.<br />
<br />
::2. W can publicly offer to refer anyone who is interested in HIT. This is similar to how referrals are shared across social networks like Facebook, Twitter. This referral sharing network can be integrated into the crowd-sourcing platform, removing the need for multiple external forums.<br />
<br />
* The quality of referrals can be incentivized in a number of ways. A rudimentary way would be to include a measure of good referrals in the scores used to rank workers and requesters. A separate index could be used for referrals or requesters may reward good referrals with bonus payments.<br />
<br />
=== Benefits of Referral Network ===<br />
<br />
A HIT initially becomes visible to highly rated workers who have worked with that particular requester in the past. The referral network enables sharing of HITs with other (existing and new) workers who would otherwise have been the last in line to see the HIT. This is especially useful when a requester posts a large number of HITs, which can be shared quickly among workers.<br />
<br />
== Assisting Requesters in Finding the '''Right Workers''' at the '''Right Time''' ==<br />
<br />
An often cited challenge for workers is that high paying HITs are quickly completed, and if the worker is not available when HITs are posted, the worker may miss out on lucrative work. A worker panelist mentioned that instances such as, skipping lunch to work on a high paying HIT, are not uncommon. This issue is compounded by the fact that workers are available globally, and tasks posted globally. <br />
<br />
To resolve this issue, we propose a simple feature that lets a requester post work at the optimal time to target the requester's most highly rated workers:<br />
<br />
Although Boomerang attempts to resolve this issue by creating a time staggered access to work, based on reputation; as the number of workers grow, this may not be enough to resolve this issue. <br />
<br />
Our system overcomes this by allowing workers to opt into a system where the platform keeps track of the time range when they are usually available and working on tasks. A requester is presented with a suggestion mentioning the optimal times to post the task on the platform and the availability percentage of workers the requester has rated highly. <br />
<br />
=== Example ===<br />
<br />
Suggestion: '''73'''% of workers you have rated as ''''Good'''' are usually available between '''9:00 am''' and '''3:00 pm'''. <br />
Would you like us to post the task for you in those hours?<br />
<br />
This has the benefit of letting requesters get the best workers (in addition to the reputation based access) and also allows highly rated workers to not miss out of lucrative work.<br />
<br />
=== Limitations === <br />
<br />
This would require workers to allow the platform to log usage timings which is a potential privacy issue. In addition, if most workers are from a certain timezone, it could skew the tasks in favor of one country.<br />
<br />
== Reputation of Newcomers==<br />
<br />
Allowing newcomers to quickly join a system, where the rating assigned by a requester can determine if a worker gets work, is of prime importance. Boomerang's approach to this is to assign a new worker with the global average rating and a time decay component that opens up tasks as time progresses. While this is better than a fixed value, it still may slow down a new worker's road to becoming a full fledged worker. <br />
<br />
'''In addition to the referral scheme''' we suggested above, we propose a simple technique to allow a new worker's ability to be evaluated: <br />
<br />
- Similar to how many freelance work websites offer skill tests, a requester who has posted tasks multiple times <br />
will be asked to offer a snippet of an old completed task to be used as a test.<br />
- New workers seeking access to a task offered by that requester complete this sample task.<br />
- New worker's answers are correlated with answers given by a worker that was judged to be good by the requester.<br />
- If there is high correlation between the answers, the new worker may be allowed to perform the latest task.<br />
<br />
This mechanism can be extended to multiple requesters, by making a new worker take a series of small task 'tests', which can be used to decide if a worker gets first access to a task, in lieu of requester's rating. This would serve as a better solution to the problem of newcomers having no requester assigned rating. <br />
<br />
=== Assumptions ===<br />
<br />
This will work only if the task is defined such that it is possible to evaluate similarity of answers by different people (eg. image labeling, reading text in an image). This system will work best only if the requester posts similar work (boomerang works on the assumption that a requester will prefer a worker he/she rated well earlier).<br />
<br />
== Milestone Contributors ==<br />
<br />
@adityanadimpalli , @sreenihit</div>Sreenihitmunakala