In a Human Centered Guild system, it is clear that one of the most important problems solved is the one concerning the quality of the tasks. As seen in , "Comparing quality between different pay per label sets (instead of pay per HIT), see rows 11-14 in Table 2, confirms the same trend: quality increases with pay. However, we can also observe evidence of a diminishing return effect, whereby the rate of increase in pay is matched by a slowing (or dropping) rate of increase in quality". Therefore, even if the amount paid for a certain task affects the outcome of it by enhancing quality, it comes bounded by a top-limit in which we see that beyond a certain point it doesn't matter anymore how much one pays because the quality will not be affected significantly. With this in mind, we're able to clearly state the problem we're solving in this submission: How can we assure that the Guilds system will increase the overall quality of tasks?
It is thoroughly known that one the major issues in crowdsourcing platforms is the lack of guarantee that a task will have a high-quality outcome. Several factors come into play when we think of the reasons for that to happen: The amount paid, the qualification of the worker in relation to the skill set required by that specific task and many others. Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers. Despite being a very popular approach, "there are several issues with relying on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests.". Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks . Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.
Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. What is quality and how do we define and measure it? Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy, relevancy, completeness, and consistency", repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the extent to which the provided outcome fulfills the requirements of the requester” as our definition when it comes to analysing the quality of any given task.
Previous research on peer review systems for quality control in Crowdsourcing
The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are known to influence the outcome of a task to test approaches involving other humans, such as exemplified by . In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker n is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". Even though this approach does not constitute a peer-to-peer application, it is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.
So far, all of our examples are based on crowdsourcing paradigms currently found online and open to access and rely on that to build their arguments and insights. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated, but it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a review system in which the task in hand is reviewed by someone more experienced before being delivered to the requester. In order to achieve that, we implement a peer-review system between high-level workers and unexperienced/newcomer workers in a guild. Its intent is to make sure that undesirable workers are not only stopped from delivering poor results but also making sure that they do not last long in the guild, affecting his reputation and impacting his ability to find further highly desirable tasks to work on.
@teomoura / @gbayomi
 - In Search of Quality in Crowdsourcing for Search Engine Evaluation, Gabriella Kazai  - Quality Management in Crowdsourcing using Gold Judges Behavior  - J.Rzeszotarski, A.Kittur - Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology  - Quality Management in Crowdsourcing using Gold Judges Behavior - p.267  - Quality Control in Crowdsourcing Systems - Issues and Directions  - P. Crosby, Quality is Free, McGraw-Hill, 1979.