Difference between revisions of "QualityControlInGuilds"

From crowdresearch
Jump to: navigation, search
(System Design)
Line 9: Line 9:
 
Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers and calibrate their output compared to the desired results. Despite being a very popular approach, there are several issues with relying  on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.  
 
Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers and calibrate their output compared to the desired results. Despite being a very popular approach, there are several issues with relying  on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.  
  
== Quality ==
+
== an idea on quality of a task ==
  
 
Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. '''What is quality and how do we define and measure it?'''  
 
Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. '''What is quality and how do we define and measure it?'''  
 
Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy,  relevancy,  completeness,  and  consistency"[5], repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the  extent  to  which  the  provided  outcome fulfills the requirements of the requester”[6] as our definition when it comes to analyzing the quality of any given task.
 
Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy,  relevancy,  completeness,  and  consistency"[5], repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the  extent  to  which  the  provided  outcome fulfills the requirements of the requester”[6] as our definition when it comes to analyzing the quality of any given task.
  
== Previous research on peer review systems for quality control in Crowdsourcing ==
+
== Previous research ==
  
The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for  while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are  known to influence the outcome of a task to test approaches involving other humans, such as exemplified by [2]. In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker '''n''' is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". Even though this approach does not constitute a peer-to-peer application, it is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.
+
The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for  while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are  known to influence the outcome of a task to test approaches involving other humans, such as exemplified by [2]. In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker '''n''' is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". This result is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.
  
 
== Our approach ==  
 
== Our approach ==  
 
So far, our examples have drawn inspiration from platforms that are online or that could easily be implemented into an existing crowdsourcing platform. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated keeping in mind that it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions (higher levels/ranks) within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a peer cooperation system in which work done by new comers and less experienced people is consistently matched against a standard of quality defined by the guild itself. For that to happen, several approaches come to mind. We could have each piece of work submitted by a less experienced person to be reviewed by an older member (or several, like the famous 5 peer reviews in MOOCs). Following that mindset, we could also define a peer review system where people gradually get less work reviewed over time as they rise to more senior positions. These examples, however, are implying that people would be at all times assuming that someone's work is wrong - and therefore we need to correct it. This mindset doesn't lead to a truly collaborative community in the long run, so we'll do our best to keep away from it.  
 
So far, our examples have drawn inspiration from platforms that are online or that could easily be implemented into an existing crowdsourcing platform. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated keeping in mind that it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions (higher levels/ranks) within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a peer cooperation system in which work done by new comers and less experienced people is consistently matched against a standard of quality defined by the guild itself. For that to happen, several approaches come to mind. We could have each piece of work submitted by a less experienced person to be reviewed by an older member (or several, like the famous 5 peer reviews in MOOCs). Following that mindset, we could also define a peer review system where people gradually get less work reviewed over time as they rise to more senior positions. These examples, however, are implying that people would be at all times assuming that someone's work is wrong - and therefore we need to correct it. This mindset doesn't lead to a truly collaborative community in the long run, so we'll do our best to keep away from it.  
  
We should think of a method in which people do get their work checked - we still need to guarantee the quality of the outcome produced by a guild. But more than that, we also need to encourage them to work as a team and not as adversaries. Deriving from this ideas, we see that letting people volunteer themselves to take on bigger responsibilities has worked in the past and it is the core believe we'll use on our vision. A worker within a guild should be able to perform tasks for as long as he wants to and only then apply to level up in the Guild structure, getting more responsibilities and possibly more complex tasks.
+
We should think of a method in which people do get their work checked - we still need to guarantee the quality of the outcome produced by a guild. But more than that, we also need to encourage them to work as a team and not as adversaries. Deriving from this ideas, we see that letting people volunteer themselves to take on bigger responsibilities has worked in the past and it is the core believe we'll use on our vision. A worker within a guild should be able to perform tasks for as long as he wants to and only then apply to level up in the Guild structure, getting more responsibilities and possibly more complex tasks. By using this method, we're able to ensure that we're addressing both the reputation and quality issues without creating friction within the community.
  
 
== System Design ==
 
== System Design ==

Revision as of 14:47, 21 February 2016

Intro

In a Human Centered Guild system, it is clear that one of the most important problems solved is the one concerning the quality of the tasks. As seen in [1], "Comparing quality between different pay per label sets (instead of pay per HIT), see rows 11-14 in Table 2, confirms the same trend: quality increases with pay. However, we can also observe evidence of a diminishing return effect, whereby the rate of increase in pay is matched by a slowing (or dropping) rate of increase in quality". Therefore, even if the amount paid for a certain task affects the outcome of it by enhancing quality, it comes bounded by a top-limit in which we see that beyond a certain point it doesn't matter anymore how much one pays because the quality will not be affected significantly. With this in mind, we're able to clearly state the problem we're solving in this submission: How can we assure that the Guilds system will increase the overall quality of tasks?

Background

It is thoroughly known that one the major issues in crowdsourcing platforms is the lack of guarantee that a task will have a high-quality outcome[2]. Several factors come into play when we think of the reasons for that to happen: The qualification of the worker in relation to the skill set required by that specific task, the amount paid, and many others.

Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers and calibrate their output compared to the desired results. Despite being a very popular approach, there are several issues with relying on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.

an idea on quality of a task

Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. What is quality and how do we define and measure it? Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy, relevancy, completeness, and consistency"[5], repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the extent to which the provided outcome fulfills the requirements of the requester”[6] as our definition when it comes to analyzing the quality of any given task.

Previous research

The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are known to influence the outcome of a task to test approaches involving other humans, such as exemplified by [2]. In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker n is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". This result is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.

Our approach

So far, our examples have drawn inspiration from platforms that are online or that could easily be implemented into an existing crowdsourcing platform. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated keeping in mind that it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions (higher levels/ranks) within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a peer cooperation system in which work done by new comers and less experienced people is consistently matched against a standard of quality defined by the guild itself. For that to happen, several approaches come to mind. We could have each piece of work submitted by a less experienced person to be reviewed by an older member (or several, like the famous 5 peer reviews in MOOCs). Following that mindset, we could also define a peer review system where people gradually get less work reviewed over time as they rise to more senior positions. These examples, however, are implying that people would be at all times assuming that someone's work is wrong - and therefore we need to correct it. This mindset doesn't lead to a truly collaborative community in the long run, so we'll do our best to keep away from it.

We should think of a method in which people do get their work checked - we still need to guarantee the quality of the outcome produced by a guild. But more than that, we also need to encourage them to work as a team and not as adversaries. Deriving from this ideas, we see that letting people volunteer themselves to take on bigger responsibilities has worked in the past and it is the core believe we'll use on our vision. A worker within a guild should be able to perform tasks for as long as he wants to and only then apply to level up in the Guild structure, getting more responsibilities and possibly more complex tasks. By using this method, we're able to ensure that we're addressing both the reputation and quality issues without creating friction within the community.

System Design

1 -> Worker finds and ask permission to join a guild of his interest;

2 -> After having his background checked against minimum requirements set by the guild, he/she is accepted as a new member (yay!);

3 -> The worker gets access to the tasks given to that guild; He chooses one to do and submits his results to the guild;

4 -> Since he is a new member, his work is submitted to review by an older member of the guild;

5 -> Once his work is reviewed and accepted, his outcome is stored in the database.

6 -> Whenever he want to move up a level inside the guild, he presses a button saying 'level up'. Once that happens, his previous works are displayed to more senior members as a task comparing his work to work we know to be high quality. We follow this procedure and apply tasks anonymously to several senior members, and if the older members acknowledge the worker has had a desirable performance, he is moved one level up.

7 -> Since he has moved up one level, harder tasks will be suggested to him. That way, we're able to apply this same procedure for each level in the guild without risking having a worker game the system.

Future works

A question raised by the community (@alipta) is on how to ensure that people reviewing your work are actually in the position to do so. Are they skilled enough? Do they have the proper experience? How can we be sure that a reviewer is properly reviewing someone else's submission?

Contributors

@teomoura @gbayomi @anotherhuman @trygve @pierref @alipta @yoni.dayan

References

[1] - In Search of Quality in Crowdsourcing for Search Engine Evaluation, Gabriella Kazai

[2] - Quality Management in Crowdsourcing using Gold Judges Behavior

[3] - J.Rzeszotarski, A.Kittur - Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology

[4] - Quality Management in Crowdsourcing using Gold Judges Behavior - p.267

[5] - Quality Control in Crowdsourcing Systems - Issues and Directions

[6] - P. Crosby, Quality is Free, McGraw-Hill, 1979.