QualityControlInGuilds

From crowdresearch
Jump to: navigation, search

Intro

In a Human Centered Guild system, it is clear that one of the most important problems solved is the one concerning the quality of the tasks. As seen in [1], "Comparing quality between different pay per label sets (instead of pay per HIT), see rows 11-14 in Table 2, confirms the same trend: quality increases with pay. However, we can also observe evidence of a diminishing return effect, whereby the rate of increase in pay is matched by a slowing (or dropping) rate of increase in quality". Therefore, even if the amount paid for a certain task affects the outcome of it by enhancing quality, it comes bounded by a top-limit in which we see that beyond a certain point it doesn't matter anymore how much one pays because the quality will not be affected significantly. With this in mind, we're able to clearly state the problem we're solving in this submission: How can we assure that the Guilds system will increase the overall quality of tasks?

Background

It is thoroughly known that one the major issues in crowdsourcing platforms is the lack of guarantee that a task will have a high-quality outcome[2]. Several factors come into play when we think of the reasons for that to happen: The qualification of the worker in relation to the skill set required by that specific task, the amount paid, and many others.

Researchers have been seeking a solution to this problem for some time now, and one approach arose as the most popular one: The Gold Standard Data technique. It basically consists in injecting work tasks to which with known answers in a attempt to control the quality of work produced by the workers and calibrate their output compared to the desired results. Despite being a very popular approach, there are several issues with relying on gold data for quality control: 1) gold data is expensive to create and each administered gold test has an associated cost, 2) in a crowdsourcing scenario, the distribution of gold tests over workers is likely to be non-uniform, with most workers completing only very few tests, if any, while a few active workers may get frequently tested, and 3) gold tests may be of varying difficulty, e.g., lucky sloppy workers may pass easy tests while unlucky diligent workers may fail on difficult tests."[2]. Other methods also emerged, such as detecting the behaviour of workers in certain tasks to predict how they'd behave in future tasks [3]. Even though it is a very pragmatic and palpable way of applying AI expertise to human generated problems, it is still vulnerable to flaws in the dataset, such as when a system is fed constantly with inputs from bad workers and misclassifies workers that should have had a good track record otherwise. Other methods also based on probabilistic analyses have also been suggested, using a huge amount of data for each worker to build a model around his/her reliability. While being fairly complex, it is easy to argue that it might not be always easy to have such data regarding every worker on the platform.

An Idea on Quality of a Task

Even though we've given a modest yet fair overview of current methods being proposed in the Task Quality debate, it is of our best interest to provide a solid basis in which we lay our discussion on. What is quality and how do we define and measure it? Previous discussions around achieving a common definition for the quality of a task have suggested several qualities that could serve as parameters for such an analysis, for example "reliability, accuracy, relevancy, completeness, and consistency"[5], repeatability and reproducibility(@anotherhuman). It is, therefore, logical to abstract those qualities into characteristics a worker must have while working on a task in order to be able to produce results that are aligned with the expectations of the requester. Even though they'd provide good metrics, it could be fairly difficult to measure such parameters and find appropriate non-dubious definition for each them. Taking that into account, it is fair to argue that any result that correctly matches the requester's expectation could be considered a high quality work, and therefore we'll align with “the extent to which the provided outcome fulfills the requirements of the requester”[6] as our definition when it comes to analyzing the quality of any given task.

Previous Research

The examples aforementioned in the section "Background" consist of many visions around our current problem. Many of them rely on mathematically founded and/or AI based solutions that tackle the issue using a very computationally intense mindset. We won't rely solely on that kind of approach, but will build upon some of its results because even thought most of those systems are fairly complex and could lead to a very good solution in the future, it is arguable that as of now, they're still in early-stage development and it could take a huge amount of time before they're able discover the correct aspects of the human idiosyncrasies to look for while trying to measure whether a worker consistently delivers high quality output in a scalable manner. However, it is possible to use some behaviors shown by a worker that are known to influence the outcome of a task to test approaches involving other humans, such as exemplified by [2]. In that experiment, they use a Gold Judge approach in which they measure several parameters of a Gold Judge, who is someone they know for sure that we'll produce high quality results, and use the data gathered from those people to learn how to predict whether a generic worker n is whether a poor performer or a desirable worker to keep on the platform. As a result of this approach, "when we look at the crowd and trained judges' behaviour together, e.g., S/P/L rows in the table, we see an increase in the number of features that significantly correlate with judge quality. This suggests that adding data that represents ideal behavior by trusted judges increases our ability to predict crowd worker's quality". This result is solid in giving us the insight that it is possible to use data from people who we know to be high-quality performers in order to classify a given worker/task as satisfactory or not before sending the results back to the requester.

Our Approach

So far, our examples have drawn inspiration from platforms that are online or that could easily be implemented into an existing crowdsourcing platform. However robust those results are, it is a whole another discussion as to whether they'd be valid when thinking of a crowdsourcing platform that has as a built-in feature a Guild System. In this submission, therefore, we'll build upon the foundations already demonstrated keeping in mind that it is of our best interest to adapt them to our own approach, where workers are able to gather with other workers and form guilds led by more experienced peers who are in higher positions (higher levels/ranks) within the guild's internal hierarchy. For the Guild system to be able to consistently provide higher quality results than a standalone worker, it needs to efficiently implement a peer cooperation system in which work done by new comers and less experienced people is consistently matched against a standard of quality defined by the guild itself. For that to happen, several approaches come to mind. We could have each piece of work submitted by a less experienced person to be reviewed by an older member (or several, like the famous 5 peer reviews in MOOCs). Following that mindset, we could also define a peer review system where people gradually get less work reviewed over time as they rise to more senior positions. These examples, however, are implying that people would be at all times assuming that someone's work is wrong - and therefore we need to correct it. This mindset doesn't lead to a truly collaborative community in the long run, so we'll do our best to keep away from it.

We should think of a method in which people do get their work checked - we still need to guarantee the quality of the outcome produced by a guild. But more than that, we also need to encourage them to work as a team and not as adversaries. Deriving from this ideas, we see that letting people volunteer themselves to take on bigger responsibilities has worked in the past and it is the core believe we'll use on our vision. A worker within a guild should be able to perform tasks for as long as he wants to and only then apply to level up in the Guild structure, getting more responsibilities and possibly more complex tasks. By using this method, we're able to ensure that we're addressing both the reputation and quality issues without creating friction within the community.

Why Guilds

When discussing Guilds or any organizational design solution for a crowdsourcing platform, one's mental model is influenced by the asymmetry in power, information access, pricing and reputation of Mturk. It is pervasive and the impulse is to correct the injustice; for us it has proven to be a nice starting point for discussion. Yet, for as unsavory the imbalance may seem, Mturk is still operating and functioning. If you were to create a new crowdsourcing platform how would you inject symmetry into the platform. We look more to systemic balance than social fairness in illiciting the idea of symmetry, in the crowdsourcing world, the invisible hand of the market is more opaque. So, we looked to history, distant and a bit more recent, for inspiration. Historically, guilds have represented craftsmen and projected an image of quality. This association is critical. In MTurk, workers are seen as low cost labor, in a balanced system, they are skilled craftsmen. So a mechanism for symmetry is to allow workers to associate within the system to share information, share training and leverage for better pricing. We see such worker impulses in the form of Turker Nation, Turkopticon and the like, yet they exist outside the system. They create a sense of community as it relates to training and development, yet we are looking at Guilds for the balancing agents they bring to the system as economic entities that impact Time, Cost, Quality. You cannot separate Benefits from Savings nor can you separate the social from the training and economic dimensions of the guild.

So, why collectives or guilds to address this problem, why not deploy a system-wide computational compatibility model? We feel that the intimacy of clustering workers around a task or a skill ultimately provides community. A cross cultural, cross generational collection of like minded individuals that share common professional/techncial interests, professional development goals and a sense of reasonableness that allows individuals to act as a network, to leverage collective intelligence, to collaborate and conspire. Bonds, that when scaled can create universal skills/compatability norms and rational dynamic pricing. Again, the need for these community elements at the bottom of the digital pyramid are arguably not as critical as in macro task environments, nevertheless, such cultural components are critical to the efficiency of the Guild.

In a guild individual workers aren't exposed to the randomness of requester reputation whims, they are protected behind a veil, by the collective reputation of the guild. This Protection allows for a freer exchange of information, for quality and learning to be ensured behind a veil and for edge costs to be absorbed in the process

Why would a requester choose a guild over an individual?

The initial engagement of requesters and guilds is a cold start challenge in its own right. Are individual requesters more inclined to trust an individual or an organization? Through mechanisms of managed risk, requesters are probably more disposed to using an organization to complete tasks over an individual. Guilds will be able to leverage their community, there individual approaches to vetting and ensuring quality…perhaps in the form of guarantees or rebating. However, one we move beyond that initial engagement, it will be up to how a guild positions itself in the market. This is a risk and a capacity question. Scale, aggregating work. Internal mechanisms (Rules, Standards and Best Practices) guarantee quality, time and cost savings. Guild members are selected based on a computational compatibility model.


Origins of the Concept of Guild

So how would guilds form, if they weren't mandated by the system or systemically created. If the system was open enough to allow for worker organization within the system, two impulses would come into the formation of guilds. Two impulses would drive/lead to Guild creation: Intrapreneurship and Homophany. "Sphere of Confidence" (Pierre) Groups created by intrapreneurs who can share their reputation with others.


Historical functions of guilds

  • Formation through apprenticeship and its regulation
  • Quality standard settings
  • Hours of work regulation
  • Diffusion of market information
  • Diffusion of best practices
  • Price fixing
  • Access to credit and financial support

How Do Guilds Work: Candidate Attraction

a Computational Compatibility tool, which they will use to identify new members, thereby validating the model and establishing the internal skills hierarchy.

“Facebook” pages : guilds are entitled to publish on a dedicated page entitled to them by the system Chat channels : the chat allocates two channels per guild : guild-open shared by all, guild-restricted

An autonomous has access to : a general chat, discussion between platform users a market chat, basically a channel where guilds are promoting themselves and requesters can also promote and annonce future job offers. Such a worker can select opened jobs/tasks to work on it and post job interests, but most importantly he could also apply to an existing guilds using direct chat contact with advertised guild officials.

How Do Guilds Work: Requester Attraction

A requester will have the ability to request a guild to work with or if they so desire, they can work with an individual. As this is a guild conversation, the requester, because of the size of the task (say image processing 25,000 photos) they come to a guild. Instead of having to give instructions, answer emails from a multitude of individuals, they have 1 point of contact. In exchange for trusting the guild, the convenience of the transaction they pay a premiuim, though moving forward we’d like to establish relationships with requesters that lead to retained services pricing structures or subscription based services. Once terms and conditions have been established, the task is released into the guild.

How Do Guilds Work: Communication and Governance

The guild can have set standards for applying. Basically, an application means that a worker declare to meet these standards. In a guild, three ranks organize relationships between members: Apprentice : new member Journeyman : confirmed member, contribute by mentoring apprentices, can monitor batch results Master : senior member, takes part to decision, have admin access to guild tools, monitor journeymen activities, can contract for the guild. (we could change that to a more fancy context if necessary)

A journeyman can and must delegate some tasks to his apprentice and is in charge of controlling the produced results. As a side-note, apprentices benefit from the performance record of the journeyman: they have access to task that could have been not accessible due to restriction of application defined by the requester. Journeyman supervision is paid by a fee collected from the amount due for task completion. Supervision is a mandatory step for achieving mastership. Journeyman management of apprentices is reviewed by a master.

Skills are managed by tree-like structures from general to more specific (cf yoni description). Regarding promotion and skills levelling, attribution to apprentice results from proposition issued by the journeyman mentor. A Journeyman can present his own case to Masters for acquiring new level of skills while masters manage their own skill acquisition on a peer to peer basis.

How Do Guilds Work: Quality Assurance

While a skill may not be necessary to perform micro tasks, many attributes are required to reliably complete tasks. Attention to detail, timeliness, task focus, immunity to repetition disorder and pride in work(maybe :) are all elements that can be heuristically measured from resumes/CV, Git Hub, MOOCS, a guild application, prior experience as a crowd worker, etc. Utilizing graph databases we’d extract data/attributes so as to define compatibility proximity to a task. The challenge may be that,when implementing in a microtask only platform, will there be enough discernable variation among workers and tasks to cluster meaningfully and/or have a statistically significant distance metric. Without this meaningful variation it is less likely to impact the task feed in a positive way. Only through development of the algorithm and underlying data model (graph vs. relational) will we be able to find the necessary variance thresholds required to determine effectiveness and appropriate implementation of this design. Until a certain density is achieved in the collective, human intervention is required to validate core competencies, until the system has achieved enough learning to automate. Spot checking will occur. Not only does computational compatibility give proximity to tasks, near misses can also be identified and converted into learning opportunities (associations with MOOCS or internal training) for workers

Minimizing the risk and enhancing the responsibility for returning the task on time in the manner requested is tethered to 1. the appropriate matching of the work to the worker going in and 2. QC checking as the product is completed and delivered. This duality of task in, task out, is a dynamic method of training and adding measurable moments to a workers profile. It gives cold starters/newbies an opportunity to review work and/or participate in live work with the safety net. In a high frequency, low latency environment this double blind process is standard practice.

Guild-errors.jpeg

System Design

1 -> Worker finds and ask permission to join a guild of his interest;

2 -> After having his background checked against minimum requirements set by the guild, he/she is accepted as a new member (yay!);

3 -> The worker gets access to the tasks given to that guild; He chooses one to do and submits his results to the guild;

4 -> Since he is a new member, his work is submitted to review by an older member of the guild;

5 -> Once his work is reviewed and accepted, his outcome is stored in the database.

6 -> Whenever he want to move up a level inside the guild, he presses a button saying 'level up'. Once that happens, his previous works are displayed to more senior members as a task comparing his work to work we know to be high quality. We follow this procedure and apply tasks anonymously to several senior members, and if the older members acknowledge the worker has had a desirable performance, he is moved one level up.

7 -> Since he has moved up one level, harder tasks will be suggested to him. That way, we're able to apply this same procedure for each level in the guild without risking having a worker game the system.

Future works

A question raised by the community (@alipta) is on how to ensure that people reviewing your work are actually in the position to do so. Are they skilled enough? Do they have the proper experience? How can we be sure that a reviewer is properly reviewing someone else's submission?

Contributors

@teomoura @gbayomi @anotherhuman @trygve @pierref @alipta @yoni.dayan

References

[1] - In Search of Quality in Crowdsourcing for Search Engine Evaluation, Gabriella Kazai

[2] - Quality Management in Crowdsourcing using Gold Judges Behavior

[3] - J.Rzeszotarski, A.Kittur - Crowdscape: interactively visualizing user behavior and output. In Proc. of the 25th annual ACM symposium on User interface software and technology

[4] - Quality Management in Crowdsourcing using Gold Judges Behavior - p.267

[5] - Quality Control in Crowdsourcing Systems - Issues and Directions

[6] - P. Crosby, Quality is Free, McGraw-Hill, 1979.