QualityControlInGuilds

From crowdresearch
Revision as of 01:09, 21 February 2016 by Tegenesantnioluzcarvalhodemoura (Talk | contribs) (Contributors)

Jump to: navigation, search

Intro

In a Human Centered Guild system, it is clear that one of the most important problems solved is the one concerning the quality of the tasks. As seen in [1], "Comparing quality between different pay per label sets (instead of pay per HIT), see rows 11-14 in Table 2, confirms the same trend: quality increases with pay. However, we can also observe evidence of a diminishing return effect, whereby the rate of increase in pay is matched by a slowing (or dropping) rate of increase in quality". Therefore, even if the amount paid for a certain task affects the outcome of it by enhancing quality, it comes bounded by a top-limit in which we see that beyond a certain point it doesn't matter anymore how much one pays because the quality will not be affected significantly. With this in mind, we're able to clearly state the problem we're solving in this submission: How can we assure that the Guilds system will increase the overall quality of tasks?

Background

It is thoroughly known that one the major issues in crowdsourcing platforms is the lack of guarantee that a task will have a high-quality outcome[2]. Several factors come into play when we think of the reasons for that to happen: The amount paid, the qualification of the worker in relation to the skill set required by that specific task and many others. Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers. Despite being a very popular approach, "there are several issues with relying on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.

Quality

Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. What is quality and how do we define and measure it? Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy, relevancy, completeness, and consistency"[5], repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the extent to which the provided outcome fulfills the requirements of the requester”[6] as our definition when it comes to analysing the quality of any given task.

Previous research on peer review systems for quality control in Crowdsourcing

The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are known to influence the outcome of a task to test approaches involving other humans, such as exemplified by [2]. In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker n is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". Even though this approach does not constitute a peer-to-peer application, it is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.

Our approach

So far, all of our examples are based on crowdsourcing paradigms currently found online and open to access and rely on that to build their arguments and insights. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated, but it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a review system in which the task in hand is reviewed by someone more experienced before being delivered to the requester. In order to achieve that, we implement a peer-review system between high-level workers and unexperienced/newcomer workers in a guild. Its intent is to make sure that undesirable workers are not only stopped from delivering poor results but also making sure that they do not last long in the guild, affecting his reputation and impacting his ability to find further highly desirable tasks to work on. A fairly straightforward way to implement that idea is to have an older guild member to review each piece submitted by people significantly younger than him to make sure no poor quality results are passed to the requester. However, in a real world context, it could represent much of a cost to keep the higher rated workers busy with reviewing other people's submission while they could be producing work with a higher likelihood of being desirable by the requester. Therefore, we must think of a solution that not only ensures that new comers/less experienced workers deliver desirable output but also keep higher ranked people with sufficient time to work themselves on tasks. A natural way to think about that is to provide mentorships for an allocated amount of time for newcomers, making high-ranked guild members act as a "gold judge" for a specified amount of time. After that, the new "graduated" workers are free to submit outputs that will be gradually less reviewed as they climb towards the higher positions within the guild's structure, making sure that they only get to a new level when they reach a threshold of acceptance into each new level.

System Design

1 -> Worker finds and ask permission to join a guild of his interest;

2 -> After having his background checked against minimum requirements set by the guild, he/she is accepted as a new member (yay!);

3 -> The worker gets access to the tasks given to that guild; He chooses one to do and submits his results to the guild;

4 -> A higher level member checks his work against a standard of quality (the selection of each worker should be randomized so as to decrease the likelihood of false positives) , giving him a feedback consisting in AC/MR/NA (AC means accepted/MR means a minor rejection/NA means rejected)

5 -> If his outcome is accepted, his ACC (Accepted task counter) increases by 1, and he becomes closer to the next level in the guild;

6 -> If his outcome is rejected with a MR, it means that he made some silly error a minor mistake that wouldn't affect the outcome of the task deeply. He gets another chance of submitting it, and his MR+C (Minor Rejection Counter) is increased by 1;

7 -> If his outcome is rejected with a "NA", his work is definitively rejected and the task is suggested to other workers. His NAC (Not Accepted Counter) increases by one;

8 -> Whenever the sum of his scores (which sum up to the number of tasks he tried to work on) reaches a number x defined by the guild, his scores are matched against the standards of the guild and if good enough, he gets moved to the next level in the guild hierarchy. Else, he is faced with two possibilities: start from the beginning or leave the guild;

9 -> If he is successful into getting to the next level, his work will be reviewed by a smaller percentage of higher ranked people (this should be defined by the guild administrators) and that ratio continues to decrease as he moves higher in the guild's structure;

10 -> After he reaches a high level within the guild ideally he wouldn't need to have his work reviewed by anyone else. It is recommended, however, that a gold task system should continue to evaluate the quality of the outcomes produced by the worker, to make sure that they won't get sloppy after some time. Because it'd be applied to fewer people, this gold task system would minimize the costs we stated in the Background section.

Future works

A question raised by the community (@alipta) is on how to ensure that people reviewing your work are actually in the position to do so. Are they skilled enough? Do they have the proper experience? How can we be sure that a reviewer is properly reviewing someone else's submission?

Contributors

@teomoura @gbayomi @anotherhuman @trygve @pierref @alipta

References

[1] - In Search of Quality in Crowdsourcing for Search Engine Evaluation, Gabriella Kazai

[2] - Quality Management in Crowdsourcing using Gold Judges Behavior

[3] - J.Rzeszotarski, A.Kittur - Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology

[4] - Quality Management in Crowdsourcing using Gold Judges Behavior - p.267

[5] - Quality Control in Crowdsourcing Systems - Issues and Directions

[6] - P. Crosby, Quality is Free, McGraw-Hill, 1979.