WinterMilestone 4 Team Carpe Noctem - Improve Trust Building On Top Of Daemo

From crowdresearch
Jump to: navigation, search

Standing at the central of the online labor market is the problem of trust. Without it, neither workers and requester are willing to engage in the long run. Like any offline market, trust is one of, the most important issue in building a successful, sustaining marketplace. Usually, trust between two people in the work setting is established on a history of working together and knowing each other well. Here, we propose some approaches and experiments to examine a systematic design on top of existing changes proposed in Daemo [Daemo: A Crowdsourced Crowdsourcing Platform, Stanford Crowd Research Collective]. The goal is to easy boot-strapping trust building, and maintain the trust by measures that incentivizes good behaviors and punish the bad.


Introduction

From previous research [Daemo and among others], it is clear that we can improve the trust between workers and requesters by showing them highly-rated users and improve the feedback loop between them. In this proposal, we focus on optimizing the reputation system and feedback loop to improve the overall user experience on the marketplace.

Use subcategory reputations in recommendation

Due to a wide range of types of work, we suspect that using a single reputation score may not accurately represent a worker’s work ethics, quality of work, areas of expertise, level of experiences etc. Instead, we break down the reputation into subcategories. There are two main types of categories, general characteristic and areas of expertise. And each subcategory has its own metrics to compute the reputation score for that subcategory.

General characteristics consist of the universal qualities of a worker, e.g. whether worker finishes tasks on time (timeliness), general requester satisfaction, work ethics, communication skill/style, work ethics etc. For a specific subcategory, e.g. timeliness, we look at all the tasks completed by this worker and the percentage of them within deadlines. Reputation for areas of expertise will be computed based on the number of tasks this worker completed in that area, the difficulty level of the tasks, the size/scope of the tasks, what role the user take and so on. These metrics will results in a reputation score for that particular task type. For example, if workers have been receiving good scores on web development consistently, he would be rated high in the area of web development.

Similar to worker reputation, requesters reputation will be divided into subcategories like task quality, response time, payment timeliness, easy-to-work-with score etc.

Our recommendation system will look at the requirements of the requester AND the reputation scores for workers. The system uses machine learning to categorize and rank the experience level of each worker/requester, then compute a match score based on subcategory matching.

Experiment 1: Compare Comprehensive Reputation and Subcategory Reputation

To compare, we set up a control group with only a comprehensive reputation score that is calculated as the mean of all subcategory reputation scores. The experimental group will use our recommendation algorithm, which uses the subcategory recommendation described above. We record the satisfactory ratings from workers/requesters to each other and evaluate the results by looking at which one produce a higher satisfactory rate overall.

Optimization Feed Ranking With More Weight on Recent History

One issue that comes up in our research is that, ratings may not accurately reflect the expectation from that worker/requesters. Many of the highly-rated users will rely on their high reputation and produce lower quality work or requests over time yet the other party may still give a higher rating due to a lack of understanding of the standard for work quality for that type of tasks and thus influenced by the other high ratings of the worker/requester. One of the ways to counter this effect is to weight the reputation more heavily on recent performance. We could use an algorithm that simulate a time series function with more weight on the ratings from recently completed tasks.

Experiment 2: Compare Weighted vs. Unweighted Reputation

We will set up two systems where one uses unweighted algorithm and one uses weighted algorithm to compute reputation. We will asks users how close they think their client's reputation and their work/request quality are. Finally, we evaluate the results by comparing which model produces smaller gap between reputation and actual performance.

Public and Private Feedback Channels

A phenomenon of rating inflation is spotted by this paper "Reputation Inflation: Evidence from an Online Labor Market by John J. Horton & Joseph M. Golden". Although the exact reasons are nowhere guaranteed, they found out that 'when these costs are reduced—namely by allowing buyers to give feedback without the seller knowing it—buyers are substantially more candid. Further, the buyers who had the strongest incentive not to be candid—namely those using the marketplace intensively—showed the biggest “candor gap."' This comes from the difference between public and private feedback which has an impact on future. Similar case can be found on labor marketplace as well where people are afraid giving negative feedback will drive away future workers/requesters. Thus, we propose to test the two feedback channels for reputation calculation. Our hypothesis is that private ratings are more accurate and less vulnerable to social or future pressure. Therefore we want to use pure private feedbacks for reputation calculation and only display the final aggregated reputation score publicly.

Experiment 3: Compare Public and Private Rating

Here we set up three groups of HITs on three system. One system uses all public ratings, one uses all private ratings and one use both. Each group computes the ratings as a simple average of all feedback ratings. We ask all users if their client's reputation matches their work/request quality. We evaluate the results to see which group scores highest on the question response.

Peer, Self, Client Evaluation in Understanding Work Quality

According to studies [Shaw, Horton and Chen, CSCW ’11], self-assessment under social setting can give surprisingly accurate scores on work quality. We ask workers to rate their own performance at the end of their HITs. If many workers work on the same task, they will have the chance to rate each other's work quality. When evaluating the work quality, we still predominantly rely on the ratings from requesters and peer workers but also weight worker's self-assessment with his trustworthiness score. If in the long run, workers consistently rate their own work much higher than client's and peer's evaluation, their reputation on trustworthiness will drop heavily.

Experiment 4: Compare Peer and Self Evaluation

We set up five systems for this experiment. One system uses only peer evaluation. One uses only self evaluation. One uses a worker peer evaluation. One uses both client and peer evaluation. Finally, one uses all three with a small weight on self evaluation which is adjusted by his trustworthiness.

We use tasks that can be quantitatively measured for this experiment since we cannot use evaluation to evaluate itself so we want objective results to measure against. We evaluate the results by comparing which group has closest work quality evaluation to the actual objective results.

Further Thoughts

We think a good reputation system, like a good search engine algorithm, needs to be calibrated over time with various factors that may be discovered in the testing process. The above are merely some of them that come to mind and we want to discuss here. The goal here is to provide the confidence in users so they can enjoy the benefits that come with their hard-earned reputation.

Vir.jpg

Milestone Contributors

  • Lucas Qiu  : @lucasq
  • Michelle Chan : @michellechan
  • Manoj Pandey : @manojpandey
  • Mengnan Wang : @mengnan