Guilds Experimental Design
Daemo will have features that provide:
- levels for workers based on anonymous peer assessment of the quality of work output
- level segmented worker pools with fair wage recommendations for requesters
- forums encouraging community development, sharing of best practices, communication between stakeholder groups
For the guilds to be successful, we believe that it is important for members to feel an increased locus of control, increased feelings of self inclusion with the group, and increased self efficacy on top of the increased hourly average pay when compared to a control group.
Does our proposed mechanism impact functional efficiency and introduce social effects, compared to a system without the mechanism?
In other words, when we introduce our guild system, what are the:
- Performance Impacts
- Social Impacts
Depending on pilot results, we expect to conduct a study in which:
- Several hundred MTurkers are randomly split between a guild case (with leveling & forum) or an otherwise identical non-guild case (with no leveling or forum)
- Tasks are posted and reviewed in both cases, and include:
- Tasks with known answers, to find correlation of work accuracy and level
- Standard surveys and activities that operationalize desired effects, to monitor psychological and sociological impacts of guilds
- Tasks from real world requesters, to understand how pricing is impacted by guilds
- We observe activity on the forum & assess it for sentiment changes during the study
- We run the study for as long as we can
Details and Options
This is a list of possible setups. We are not fixed to one and we are looking for more ideas!
1. Guild Trial vs. Control
Set up a simple environment where workers and requesters can experience one of 2 cases. One example from Micheal follows:
We can evaluate this guild peer assessment system by recruiting 200 workers from Amazon Mechanical Turk to perform work over a period of three weeks. A sample of the work posted to the platform will have gold-standard (automatically-testable) quality, for example a transcription task that can be compared to a previously published transcription for accuracy.
We could randomize workers into either Daemo's guild system or into a non-guild equivalent. All workers will complete tasks. Workers in the guild condition will perform peer evaluations, receive ladder ranks, and participate in the guild forum boards. Workers in the control condition will have no peer review or ladder ranks, similar to microtask platforms such as Amazon Mechanical Turk and Crowdflower. Because the two approaches involve different payment structures, we will hold the total amount of work constant between conditions and measure the different cost outcomes.
At the conclusion of the study, we will calculate worker accuracy scores on the gold-standard tasks. We will then compute a correlation of workers’ reputation against their accuracy on the gold standard tasks. Reputation for workers in the guildcondition will be determined by their guild ranking; reputation for workers in the control condition will be determined by their work acceptance rate on the platform, corresponding to current best practice. We hypothesize that the guild condition will result in a higher correlation than the control condition. Second, we will measure self-reported inclusion and self-efficacy via survey instruments, hypothesizing that they will be higher in the guild condition. Third, we will measure the differences in overall cost between conditions, as well as the percentage of work completed by each ladder rank in the guild condition.
Some added thoughts on this
See leveling study related thoughts.
Requesters are expected to find benefit from these guild processes when they receive completed work after worker review. We will incorporate ground truth tasks to assure the accuracy of work submissions. To verify the differences between the groups, we will compare acceptance rates between the guilds and controls.
For the minor research questions, we will review performance attributes such as [...] and conduct surveys with the participants using the perspective of interactionist psychology (person * task) self efficacy (Bandura), locus of control [seeking a good measure] and self reports about worker's experiences with the guilds. To mitigate potential worker trait effects found within the results, we will execute a modified Core Self Evaluation [Bono] scale before, during and after the 3 week period. The scale will serve as a controlling variable for analysis of Generalized Self Efficacy and Trait Locus of Control. To control for the aspect of proactive behavior associated with self-inclusion, we will incorporate a measure of personal initiative [seeking Frese's work] to eliminate as much as possible effects of personal traits.
Advantage: We learn about the exact effects based on a comparison.
Disadvantage: Might not be possible to iterate dynamically.
2. Conduct a Longitudinal Social Experiment
Set up a simple environment for workers and requestors and evolve it regularly based on their feedback. Document feedback as well as metrics over the course of the study.
Advantage: We can iterate a lot and use a participatory/design thinking methodology
Disadvantage: No comparative study, all our results are only comparable to results at earlier times within the study.
Add more ideas here!