From crowdresearch
Revision as of 22:46, 20 February 2016 by Tegenesantnioluzcarvalhodemoura (Talk | contribs) (Background)

Jump to: navigation, search


In a Human Centered Guild system, it is clear that one of the most important problems solved is the one concerning the quality of the tasks. As seen in [1], "Comparing quality between different pay per label sets (instead of pay per HIT), see rows 11-14 in Table 2, confirms the same trend: quality increases with pay. However, we can also observe evidence of a diminishing return effect, whereby the rate of increase in pay is matched by a slowing (or dropping) rate of increase in quality". Therefore, even if the amount paid for a certain task affects the outcome of it by enhancing quality, it comes bounded by a top-limit in which we see that beyond a certain point it doesn't matter anymore how much one pays because the quality will not be affected significantly. With this in mind, we're able to clearly state the problem we're solving in this submission: How can we assure that the Guilds system will increase the overall quality of tasks?


It is thoroughly known that one the major issues in crowdsourcing platforms is the lack of guarantee that a task will have a high-quality outcome[2]. Several factors come into play when we think of the reasons for that to happen: The amount paid, the qualification of the worker in relation to the skill set required by that specific task and many others. Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers. Despite being a very popular approach, "there are several issues with relying on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise[4].


@teomoura / @gbayomi


[1] - In Search of Quality in Crowdsourcing for Search Engine Evaluation, Gabriella Kazai [2] - Quality Management in Crowdsourcing using Gold Judges Behavior [3] - J.Rzeszotarski, A.Kittur - Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology