From crowdresearch
Revision as of 14:38, 21 February 2016 by Tegenesantnioluzcarvalhodemoura (Talk | contribs) (Our approach)

Jump to: navigation, search


In a Human Centered Guild system, it is clear that one of the most important problems solved is the one concerning the quality of the tasks. As seen in [1], "Comparing quality between different pay per label sets (instead of pay per HIT), see rows 11-14 in Table 2, confirms the same trend: quality increases with pay. However, we can also observe evidence of a diminishing return effect, whereby the rate of increase in pay is matched by a slowing (or dropping) rate of increase in quality". Therefore, even if the amount paid for a certain task affects the outcome of it by enhancing quality, it comes bounded by a top-limit in which we see that beyond a certain point it doesn't matter anymore how much one pays because the quality will not be affected significantly. With this in mind, we're able to clearly state the problem we're solving in this submission: How can we assure that the Guilds system will increase the overall quality of tasks?


It is thoroughly known that one the major issues in crowdsourcing platforms is the lack of guarantee that a task will have a high-quality outcome[2]. Several factors come into play when we think of the reasons for that to happen: The amount paid, the qualification of the worker in relation to the skill set required by that specific task and many others. Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers. Despite being a very popular approach, "there are several issues with relying on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.


Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. What is quality and how do we define and measure it? Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy, relevancy, completeness, and consistency"[5], repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the extent to which the provided outcome fulfills the requirements of the requester”[6] as our definition when it comes to analysing the quality of any given task.

Previous research on peer review systems for quality control in Crowdsourcing

The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are known to influence the outcome of a task to test approaches involving other humans, such as exemplified by [2]. In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker n is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". Even though this approach does not constitute a peer-to-peer application, it is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.

Our approach

So far, our examples have drawn inspiration from platforms that are online or that could easily be implemented into an existing crowdsourcing platform. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated keeping in mind that it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a peer cooperation system in which work done by new comers and less experienced people is consistently matched against a standard of quality defined by the guild itself. For that to happen, several approaches come to mind. We could have each piece of work submitted by a less experienced person to be reviewed by an older member. Following that mindset, we could also define a peer review system where people gradually get less work reviewed over time as they rise to more senior positions. These examples, however, are implying that people would be at all times be assuming that someone's work is wrong - and therefore we need to correct it. This mindset doesn't lead to a truly collaborative community in the long run, so we'll do our best to keep away from it.

We should think of a method in which people do get their work checked - we still need to guarantee the quality of the outcome produced by a guild. But more than that, we also need to encourage them to work as a team and not as adversaries. Deriving from this ideas, we see that letting people volunteer themselves to take on bigger responsibilities has worked in the past and it is the core believe we'll use on our vision. A worker within a guild should be able to perform tasks for as long as he wants to and only then apply to level up in the Guild structure, getting more responsibilities and possibly more complex tasks.

System Design

1 -> Worker finds and ask permission to join a guild of his interest;

2 -> After having his background checked against minimum requirements set by the guild, he/she is accepted as a new member (yay!);

3 -> The worker gets access to the tasks given to that guild; He chooses one to do and submits his results to the guild;

4 -> A higher level member checks his work against a standard of quality (the selection of each worker should be randomized so as to decrease the likelihood of false positives) , giving him a feedback consisting in AC/MR/NA (AC means accepted/MR means a minor rejection/NA means rejected)

5 -> If his outcome is accepted, his ACC (Accepted task counter) increases by 1, and he becomes closer to the next level in the guild;

6 -> If his outcome is rejected with a MR, it means that he made some silly error a minor mistake that wouldn't affect the outcome of the task deeply. He gets another chance of submitting it, and his MR+C (Minor Rejection Counter) is increased by 1;

7 -> If his outcome is rejected with a "NA", his work is definitively rejected and the task is suggested to other workers. His NAC (Not Accepted Counter) increases by one;

8 -> Whenever the sum of his scores (which sum up to the number of tasks he tried to work on) reaches a number x defined by the guild, his scores are matched against the standards of the guild and if good enough, he gets moved to the next level in the guild hierarchy. Else, he is faced with two possibilities: start from the beginning or leave the guild;

9 -> If he is successful into getting to the next level, his work will be reviewed by a smaller percentage of higher ranked people (this should be defined by the guild administrators) and that ratio continues to decrease as he moves higher in the guild's structure;

10 -> After he reaches a high level within the guild ideally he wouldn't need to have his work reviewed by anyone else. It is recommended, however, that a gold task system should continue to evaluate the quality of the outcomes produced by the worker, to make sure that they won't get sloppy after some time. Because it'd be applied to fewer people, this gold task system would minimize the costs we stated in the Background section.

Future works

A question raised by the community (@alipta) is on how to ensure that people reviewing your work are actually in the position to do so. Are they skilled enough? Do they have the proper experience? How can we be sure that a reviewer is properly reviewing someone else's submission?


@teomoura @gbayomi @anotherhuman @trygve @pierref @alipta


[1] - In Search of Quality in Crowdsourcing for Search Engine Evaluation, Gabriella Kazai

[2] - Quality Management in Crowdsourcing using Gold Judges Behavior

[3] - J.Rzeszotarski, A.Kittur - Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology

[4] - Quality Management in Crowdsourcing using Gold Judges Behavior - p.267

[5] - Quality Control in Crowdsourcing Systems - Issues and Directions

[6] - P. Crosby, Quality is Free, McGraw-Hill, 1979.