WinterMilestone 4 Team Carpe Noctem - Improve Trust Building On Top Of Daemo

From crowdresearch
Revision as of 16:09, 8 February 2016 by Lucasq (Talk | contribs) (Experiment 1: compare comprehensive rating and specific rating)

Jump to: navigation, search

Standing at the central of the online labor market is the problem of trust. Without it, neither workers and requester are willing to engage in the long run. Like any offline market, trust is one of, the most important issue in building a successful, sustaining marketplace. Usually, trust between two people in the work setting is established on a history of working together and knowing each other well. Here, we propose some approaches and experiments to examine a systematic design on top of existing changes proposed in Daemo [Daemo: A Crowdsourced Crowdsourcing Platform, Stanford Crowd Research Collective]. The goal is to easy boot-strapping trust building, and maintain the trust by measures that incentivizes good behaviors and punish the bad.


From our previous interviews and research [Daemo and among others], it is clear that we can improve the trust between workers and requesters by showing them highly-rated users and improve the feedback loop between them. In this proposal, we focus on optimizing the reputation system and feedback loop to improve the overall user experience on the marketplace.

Use subcategory reputations in recommendation

Due to a wide range of types of work, we suspect that using a single reputation score may not accurately represent a worker’s work ethics, quality of work, areas of expertise, level of experiences etc. Instead, we break down the reputation into subcategories. There are two main types of categories, general characteristic and areas of expertise. And each subcategory has its own metrics to compute the reputation score for that subcategory.

General characteristics consist of the universal qualities of a worker, e.g. whether worker finishes tasks on time (timeliness), general requester satisfaction, work ethics, communication skill/style, work ethics etc. For a specific subcategory, e.g. timeliness, we look at all the tasks completed by this worker and the percentage of them within deadlines. Reputation for areas of expertise will be computed based on the number of tasks this worker completed in that area, the difficulty level of the tasks, the size/scope of the tasks, what role the user take and so on. These metrics will results in a reputation score for that particular task type. For example, if workers have been receiving good scores on web development consistently, he would be rated high in the area of web development.

Similar to worker reputation, requesters reputation will be divided into subcategories like task quality, response time, payment timeliness, easy-to-work-with score etc.

Our recommendation system will look at the requirements of the requester AND the reputation scores for workers. The system uses machine learning to categorize and rank the experience level of each worker/requester, then compute a match score based on subcategory matching.

Experiment 1: Compare Comprehensive Reputation and Subcategory Reputation

To compare, we set up a control group with only a comprehensive reputation score that is calculated as the mean of all subcategory reputation scores. The experimental group will use our recommendation algorithm, which uses the subcategory recommendation described above. We record the satisfactory ratings from workers/requesters to each other and evaluate the results by looking at which one produce a higher satisfactory rate overall.

Experiment 2: Optimization Feed Ranking With More Weight on Recent History

One issue that comes up in our research is that, ratings may not accurately reflect the expectation from that worker/requesters. Many of the highly-rated users will rely on their high reputation and produce lower quality work or requests over time yet the other party may still give a higher rating due to a lack of understanding of the standard for work quality for that type of tasks and thus influenced by the other high ratings of the worker/requester. One of the ways to counter this effect is to weight the reputation more heavily on recent performance. We could use an algorithm that simulate a time series function with more weight on the ratings from recently completed tasks.

Experiment 2: Compare Weighted vs. Unweighted Reputation

We will set up two systems where one uses unweighted algorithm and one uses weighted algorithm to compute reputation. We will asks users how close they think their client's reputation and their work/request quality are. Finally, we evaluate the results by comparing which model produces smaller gap between reputation and actual performance.

public, private evaluation/rating incorporated into overall rating

Experiment 3: compare public and private rating

peer, client evaluation/rating incorporated into overall rating

According to studies [Shaw, Horton and Chen, CSCW ’11], the social pressure from others can make workers more cautious and motivated about his work quality. We thus inform workers that they will be evaluated by their fellow collaborators and requesters on the same tasks during the work process. The feedback will be provided (with anonymous option) to the workers during the process so they can adjust as they go instead of waiting to be corrected after mistakes have caused damage. Consistent bad feedback, i.e. no improvement after feedback is provided, will harm workers reputation score.

Experiment 4: compare public and private rating

Milestone Contributors