Milestone 6 TuringMachine
See also: Milestone 6 TuringMachine Code
Trust Rank, Sustainable Reputation Framework for the Volatile Crowdsourcing Marketplace
Snehal (Neil) Gaikwad
In the current economic downturn, the Crowdsourcing Marketplace has shown a strong potential to support our social and professional lives. However, the lack of trust between market participants has not only increased the volatility in the marketplace, but also raised serious concerns about its sustainability in a long run. The market participants - requestors and workers are highly exposed to the risk of bad quality work or payment defaults. At present, there exists no mechanism to help the participants quantify the risk involved in the crowdsourced projects. In this paper, we analyze the major trust issues in the marketplace. Based on the practices in structured finance and credit ratings, we propose Trust Rank, Sustainable Reputation Framework for the Volatile Crowdsourcing Marketplace. Trust Rank uses the reputation score to categorize participants in different tranches, risk buckets. The ecosystem around the reputation score helps sustain the trust level by rewarding good practices and punishing the bad ones. Furthermore, we build recommender system to help the market participants diversify their work/task-force portfolio based on their risk appetite. We validate the framework via one month field study and demonstrate that groups exposed to Trust Rank Framework were able to produce high quality work and followed fair practices as compared to the groups working in the traditional marketplaces.
Over the decade, Mechanical Turk (MTurk), oDesk, and Task Rabbit have created a new paid Crowdsourcing Marketplace, which has helped many households fulfill their daily financial needs in tough economic conditions. The MTurk workers' median household income $50,000 per year is at the par with the median US household income MTurk Demographics, Panagiotis  . Unfortunately, despite of being around for a long time, the crowdsourcing marketplace has been experiencing a large degree of broken trust between the market participants i.e. workers and requestors.
The requestors don't easily trust the results they get back from the workers, whereas workers face uncertainty about the requester’s intention to pay and treat them fairly. In Being a Turker Martin et.al 2008 , one of the workers highlights: Got a mass rejection from some hits I did for them! Talked to other turkers that I know in real life and the same thing happened to them. There rejection comments are also really demeaning. Definitely avoid! Another wrote: We can be rejected yet the requestors still have our articles and sentences.. Not Fair. On other hand, Kittur etal 2008 , requestors wrote about how spammers can significantly degrade the quality of work and waste the requestors' resources. This lack of trust between the participants may break the marketplace rather than advancing it.
The distribution of trust, power, and reputation plays a key role in sustaining online collaborations. There is inherent risk & volatility involve when strangers collaborate in virtual environments. Therefore, it is essential to understand and quantify such uncertainty. This understating will help us establish the trust and reputation in the marketplace.
The work presented in the paper builds upon research from Crowd-Computing, Organizational Behavior, Economics, and Finance.
Requestors often don't trust the quality of results they get back form the workers. Various approaches have been invented to improve the quality of work. Bernstein et.al 2010  proposed the aggregation of results from multiple workers. Kittur et. al 2011  implemented the map-reduce framework. Kittur et. al 2008  introduced additional questions that have verifiable quantitative answers. Gupta et.al 2012  proposed majority of agreement in language digitization. Mason & Watts 2009  demonstrated the anchoring effect and claimed that increased financial incentives doesn't affect the quality of work produced. Chandlera et.al 2010  provided context about the task to improve the task results. Dow et.al 2012  introduced the rubric for self and peer assessments. As crowdsourcing is becoming more complex (see Retelny et.al ), additional standards are required to ensure the quality of results. Due to diverse verity of tasks involved, it is often challenging to introduce common standards that will help produce high quality results. However, new mechanism can be designed to compliment existing standards and augment the quality of results.
Workers, on other hand don't believe in the requester's intention to pay and treat them fairly. Several forums (Turker Nation, MTurk Forum, MTurk Grind,) and platforms have been created to allow workers voice their opinions. Turkopticon Irani et.al 2013  allows workers to evaluate the requestors on AMT. However, most of the review forums are isolated from the requestors' reach, they never get the feedback about their behavior.
From the workers' & requestors' perspective it is crucial to estimate the amount of time require to complete the task. Cheng et.al. 2015 proposed the Error-Time-Area, which measures the effort required in the crowdsourced tasks and determines the its fair price. This research opens various opportunities to estimate & quantify the risk involved in crowdsourcing work.
Organizational Behavior and Economics
Dasgupta 2000  highlights the relationship between Trust and Reputation. Reputation is a capital asset that can help build the trust between strangers. Reputation can be earned or destroyed by pursuing certain courses of actions. According to Lewis & Weigert 1985 , McAllister 1995 , trust is a multidimensional construct that has cognitive and affective dimensions. The cognitive dimensions consist of competence, reliability, and professionalism, whereas the affective dimensions describe caring, benevolence, & emotional connection to each other. The workers and requestors in the traditional Crowdsourcing Marketplace operate in isolation with minimal communication, which leads to the lack of caring and emotional connection. Over the period of time this imbalance in the affective elements becomes a recipe for breaking the trust equilibrium.
Credit Ratings, Risk Management, & Reputation in the Financial Systems
Trust & Reputation: Over the centuries we have developed a strong trust in the financial and economic systems. Credit Ratings play a vital role in sustaining the trust between the market participants. The ratings demonstrates the reputation & financial power of the Nation, Institutions, Corporate Companies, and Individual Citizens. In case of a citizen, credit history provides a record of borrower's repayment of loans. The Credit Score is a function of the citizen's credit history. Good Credit Score can help the citizen make fortune, but bad Credit Score can make his life harder, costing him a job. For example, TSA, leading US government agency doesn't hire applicants with "delinquent debt" Bad Credit .
Risk Management In structured finance, securities are categorized based on the ratings and investment risk involved. An investor uses various strategies to optimize his portfolio and reduce the non systematic risk. He can diversify the portfolio by investing in different assets. This is similar to saying: Don't put all your eggs in one basket.
Trust Rank, Sustainable Reputation Framework for the Volatile Crowdsourcing Marketplace
Reputation Score and Ranking as indicators of the trust
Based on the proven practices in the financial systems, we created the Reputation Score for participants in the Crowdsource Marketplace. The Reputation Score is a function of historical activities of the participant. Using the Reputation Score we built the Global Ranking of the participants.
Risk Diversification, first time for the crowdsourcing market With Trust Rank Framework the market participants can ask following questions and maximize the returns on task portfolio. A Worker is concern about What is the likelihood that a requestor will not default on payment or nor reject the task unfairly:
- We build the diversified Tasks Portfolio for Workers that will help him select the set of tasks that equally distribute the risk of payment defaults and maximize the gain. Workers can set up the goal, the system will recommend the tasks based on risk appetite and time horizon.
A Requestor is concerned about: What is the likelihood that workers will produce a great work?
- We build the diversified Workers Portfolio for Requestors that will help him select workers so that they would diversify the risk of quality default and maximize the gain. Here gain might be Budget/resource allocation or optimization.
Advantages of the Trust Rank Framework
- Trust Rank Framework enables participants trust each others using the Reputation Scores and Ranks.
- Trust Rank Framework allows participant foresee the risk involved while working with someone (quality) or working for someone (fairness).
- Trust Rank Framework leverages leaderboard to force participant to engage in open & fair practices.
We propose Trust Rank, a Reputation & Trust Framework for Volatile CrowdMarketplaces based on three components: first, Reputation Score Based Social Ranks, second, Task Portfolio Risk Diversification, and third Task & Risk specific Recommendation Engine. In what follows, we discuss the system in further details.
1) REPUTATION SCORE BASED SOCIAL RANKS
- Reputation Score & Mechanism to build the Ranking Profile: Reputation Score is similar to the Credit Score or Credit Ratings used in the financial systems. Workers rate the requestors based on various ranking parameters. Generosity, Fairness, Promptness, and Communicativity are widely used ranking parameters Irani et.al 2013  . In addition, we introduce Probability of Payment Defaults, PPD as a parameter to indicate historical track record of the requestors. The Ranking Function maps the ranking parameter vector to overall Reputation Score & Categories , which indicates the likelihood of misconduct of the requestor. Higher the Reputation Score, lower the chance of misconduct.
- Leaderboard Social Reputation Reputation Score is used to encourage and reward the top performers and punish the bad guys, see figure 2. Depending on the scores, the workers/requestors are ranked and categorized in following buckets:
- A, HALL OF FAME: score above 90%. Perks: 1) Membership to the Crowdsourcing Standards Institute, which defines the best practices to sustain the marketplace. 2) Ability to post various tasks infrequent number of times 3) Ability to select the top performing crowd for the work 4) allowed to manage/lead communities 5) Minimum Wage: Will need to pay historic average payment for the task
- B, FAIR: score between 70%-90%. Perks: 1) Manage/lead certain number of communities 2) Ability to post various tasks infrequent number of times 3) Temporary Membership to the Crowdsourcing Standards Institute 4) Certified to run task design workshops for other requestors 5) Minimum Wage: Will need to pay historic average payment for the task
- C, GOOD: score between 40%-70%. Perks: 1) Allowed to be volunteer and serve the marketplace 2) Allowed to post various tasks certain number of times of the week 3) Minimum Wage: Will need to pay some percent higher than the historic average payment for the task
- D, POOR: score between 10%-40%. Perks: 1) Can post limited number of tasks per week 2) Allowed to be volunteer 3) Encouraged to attend the task design workshops and improve the performance 4) Minimum Wage: Will need to pay higher than historic average payment for the task
- E, HALL OF SHAME: score below 10% of the requestors- serious violators, banned from the community.
- Fig 2 show the Leaderboard Interface that provides extrinsic motivation for workers and requestors participate and engage in high standard practices. We propose similar mechanism for ranking the workers.
2) TASK PORTFOLIO RISK DIVERSIFICATION
- Risk diversification allows a worker choose the tasks depending on his risk appetite. Imagine a worker who wants to earn certain amount of money within a given timeframe. Based on the Reputation Scores, he builds the task portfolio and maximizes the profit by diversifying the risk.
- As shown in the Portfolio Manager Interface (see fig 3), the worker uses risk parameters and selects the tasks depending on a risk associated with it. The system presents, workers with Live Portfolio Health and indicates the probability of default.
- Workers & requestors will be able to calculate the expected value of Returns before they decide to add the task to their portfolio.
3) WORKER -TASK & REQUESTOR WORKERS RECOMMENDATIONS FOR THE PORTFOLIO
- To help the workers select tasks, we combine Recommendation engine with Reputation Scores. The recommendation engine is build using the baseline method with latent factor model and avoids the classic cold start problem.
- Diversified Task Portfolio for the Workers: Recommend the combination of the tasks that would diversify the risk of payment defaults and maximize the gain. Workers can submit the diversification criteria at run time.
- Diversified Worker's Portfolio for the Requestors: Recommend the combination of the workers that would diversify the risk of quality default and maximize the gain. Requestor can submit the diversification criteria at run time. This will help the requestors utilize his budget effectively and select a team of workers with different level of experiences.
The main goal of Trust Rank is to establish broken trust between requestors and workers.
- We hypothesize that the Trust and Social Reputation will have positive influence on quality of results submitted and prevent the unjustified rejections.
- REPUTATION SCORE will help Workers & Requestors diversify the risk associated with the tasks & accomplish their targets.
Study 1: Controlled Experiment & Survey
- We will run a field deployment with two groups of randomly selected requestors and workers for duration of one month. In first group participants will be exposed to MTurk, where there is not REPUTATION RANKING is present. Second group will be exposed to Trust Rank Framework.
- We will built the requirements for 10 tasks with various complexities, skill levels, and duration. Then Requestors participating in the groups will design the HITs and publish it to the crowd.
- We will measure the results submitted by the workers in both groups based on three categories: 1) Above Par 2) At the Par 3) Below Par. We will also record the time required to complete the task and assign scores for submissions in each category.
- We will measure the quality of rejections by requestors in both groups based on three categories: 1) Fair Rejection 2) Unfair Rejection 3) Disputable Rejection. We will record the time required to complete the review and assign scores for rejections in each category. Scores are based on quality of feedback given by requestors.
- We will derive dependent variable REPUTATION SCORE based on results of above 2 factors.
- We will conduct a survey on Trust Rank and MTurk: 5 Pointer Likert based on three fundamental aspects Trust, Reputation, Risk Appetite, & Motivation behind participation
- Analysis with t-test (Welch's)
- Initially, we can run the test for normality in the data. We will plot the histograms and QQ graphs and then run both Shapiro-Wilk and Kolmogorov-Smirnov test on data (both tests are run because sample size is smaller)
- With Welch's test we can find whether the significant effect REPUTATION SCORE outperforms the Current marketplaces. We will report statistics as (t(28)=2.20, p<0.05, Cohen's d=0.95) numbers are made up
- To understand the effect of diversification of the portfolio we expose the group 1 to the REPUTATION RANKING and run study for 2 more weeks. Automated algorithm can calculate probability weighted average of returns works/requestors will earn in they decide to include particular task to their portfolio.
Study 2: Evaluation of the Ranking Function
- Ranking Function Maps various ranking parameters to the Ranking Score and Ranking Categories including A, B, C, D, E
- We will run Principal Component Analysis on the ranking parameters and select top 3 components that explains over 85% variance in the ranking data.
- We will then apply Multinomial logistic regression or Neural Network Classifier predict the Ranking Class of the requestor.
- We can measure the performance of the classifier based on the Test Dataset. We store the historical rating data in the databases.
- Performance of the Recommendation Engine can be evaluated in similar fashion using historical data.
- Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '08). ACM, New York, NY, USA, 453-456. DOI=10.1145/1357054.1357127 http://doi.acm.org/10.1145/1357054.1357127
- Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. CrowdForge: crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology (UIST '11). ACM, New York, NY, USA, 43-52. DOI=10.1145/2047196.2047202 http://doi.acm.org/10.1145/2047196.2047202
- Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B.,Ackerman, M.S., Karger, D.R., Crowell, D., and Panovich,K. Soylent: a word processor with a crowd inside. Proc ofACM Symposium on User Interface Software and Technology (2010), 313–322.
- Chandler, D. and Kapelner, A. Breaking monotony with meaning: Motivation in crowdsourcing markets. Universityof Chicago mimeo, (2010).
- Daniela Retelny, Sébastien Robaszkiewicz, Alexandra To, Walter S. Lasecki, Jay Patel, Negar Rahmati, Tulsee Doshi, Melissa Valentine, and Michael S. Bernstein. 2014. Expert crowdsourcing with flash teams. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14). ACM, New York, NY, USA, 75-85. DOI=10.1145/2642918.2647409 http://doi.acm.org/10.1145/2642918.2647409
- Dasgupta, Partha (2000) ‘Trust as a Commodity’, in Gambetta, Diego (ed.) Trust: Making and Breaking Cooperative Relations, electronic edition, Department of Sociology, University of Oxford, chapter 4, pp. 49-72
- David Martin, Benjamin V. Hanrahan, Jacki O'Neill, and Neha Gupta. 2014. Being a turker. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW '14). ACM, New York, NY, USA, 224-235. DOI=10.1145/2531602.2531663 http://doi.acm.org/10.1145/2531602.2531663
- DJ McAllister Affect-and cognition-based trust as foundations for interpersonal cooperation in organizations - Academy of management journal, 1995
- Gupta A, Thies W, Edward Cutrell, and Ravin Balakrishnan. 2012. mClerk: enabling mobile crowdsourcing in developing regions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12). ACM, New York, NY, USA, 1843-1852. DOI=10.1145/2207676.2208320 http://doi.acm.org/10.1145/2207676.2208320
- Langohr, Herwig M. The rating agencies and their credit ratings : what they are, how they work and why they are relevant, 2008
- Lewis, J. D., & Weigert, A. (1985). Trust as a social reality. Social Forces, 63, 967–985.
- Lilly C. Irani and M. Six Silberman. 2013. Turkopticon: interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '13). ACM, New York, NY, USA, 611-620. DOI=10.1145/2470654.2470742 http://doi.acm.org/10.1145/2470654.2470742
- MTurk Demographics http://demographics.mturk-tracker.com/#/gender/all
- Panagiotis G. Ipeirotis. 2010. Analyzing the Amazon Mechanical Turk marketplace. XRDS 17, 2 (December 2010), 16-21. DOI=10.1145/1869086.1869094 http://doi.acm.org/10.1145/1869086.186909
- Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW '12). ACM, New York, NY, USA, 1013-1022. DOI=10.1145/2145204.2145355 http://doi.acm.org/10.1145/2145204.2145355
- Justin Cheng, Jaime Teevan, and Michael Bernstein. Measuring Crowdsourcing Effort with Error-Time Curves. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2015), Seoul, Korea, April 2015.
- Winter Mason and Duncan J. Watts. 2009. Financial incentives and the "performance of crowds". In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP '09), Paul Bennett, Raman Chandrasekar, Max Chickering, Panos Ipeirotis, Edith Law, Anton Mityagin, Foster Provost, and Luis von Ahn (Eds.). ACM, New York, NY, USA, 77-85. DOI=10.1145/1600150.1600175 http://doi.acm.org/10.1145/1600150.1600175