Milestone 7 TuringMachine
Trust Rank, Sustainable Reputation Framework for the Crowdsourcing Marketplace
Snehal (Neil) Gaikwad
In the current economic downturn, the Crowdsourcing Marketplaces have shown a strong potential to support our social and professional lives. The requestors participate in the marketplace to get high-quality results in a quick time, whereas workers are motivated to earn money, gain reputation, and improve skills required to build their careers. Fulfillment of these fundamental expectations make crowdsourcing operation successful. However, the lack of trust between the market participants - requestors and workers, has not only increased the volatility in the marketplace, but also raised serious concerns about its sustainability in a long run. In this paper, we analyze rise of various trust issues in the marketplace. We find market participants are highly exposed to the risk of bad quality work or payment defaults. At present, there exists no mechanism to help the participants quantify the risk involved in the crowdsourced projects. Based on the practices in structured finance and credit ratings, we propose Trust Rank, Sustainable Reputation Framework for the Volatile Crowdsourcing Marketplace. Trust Rank uses the reputation score to categorize participants in different tranches, risk buckets. The ecosystem around the reputation score helps sustain the trust level by rewarding good practices and punishing the bad ones. Furthermore, we build recommender system to help the market participants diversify their work/task-force portfolio based on their risk appetite. We validate the framework via one month field study and demonstrate that groups exposed to Trust Rank Framework were able to produce high quality work and followed fair practices as compared to the groups working in the traditional marketplaces.
Micro and Expert Crowdsourcing Marketplace
The rise of Mobile Computing and the World Wide Web have helped us harness the Collective Wisdom of Crowd at a large scale. Over the decade, the crowdsourcing platforms have been used to solve various problems, see Exhibit 1, Malone et.al. . In paid crowdsourcing, Mechanical Turk (MTurk), oDesk, and Task Rabbit have created a new marketplace, which has helped many households fulfill their daily financial needs in tough economic conditions. The MTurk workers' median household income $50,000 per year is at the par with the median US household income MTurk Demographics, Panagiotis  . Unfortunately, despite of being around for a long time, the crowdsourcing marketplace has been experiencing a large degree of broken trust between the market participants i.e. workers and requestors. It has become essential to address the core trust related issues to sustain the crowdsourcing ecosystem.
Trust Issues impacting the fundamental expectations of the Marketplace Participants
The requestors participate in the marketplace to get high-quality results in a quick time, whereas workers are motivated to earn money, gain reputation, and improve skills required to build their careers see Exhibit 2. Fulfillment of these fundamental expectations make crowdsourcing operation successful. However, due to broken trust, it has become hard for requestors easily trust the results they get back from the workers, whereas workers face uncertainty about the requester’s intention to pay and treat them fairly. In Being a Turker Martin et.al 2008 , one of the workers highlights: Got a mass rejection from some hits I did for them! Talked to other turkers that I know in real life and the same thing happened to them. There rejection comments are also really demeaning. Definitely avoid! Another wrote: We can be rejected yet the requestors still have our articles and sentences.. Not Fair. On other hand, Kittur etal 2008 , requestors wrote about how spammers can significantly degrade the quality of work and waste the requestors' resources. This lack of trust between the participants may break the marketplace rather than advancing it.
Need for reestablishing the trust
Kittur et.al. 2013  , argues Reputation, Credential, Motivation, and Reward are amongst the 12 fundamental blocks of the successful crowdsourcing marketplace. The distribution of trust, power, and reputation plays a key role in sustaining the online collaborations. There is inherent risk & volatility involve when strangers collaborate in virtual environments. Therefore, it is essential to understand and quantify such uncertainty. This understating will help us reestablish the trust and reputation in the marketplace.
The work presented in the paper builds upon research from Crowd-Computing, Organizational Behavior, Economics, and Finance.
Establishing the requestors' trust in the workers
Requestors often don't trust the quality of results they get back form the workers. Various approaches have been invented to improve the quality of work. Bernstein et.al 2010  proposed the aggregation of results from multiple workers. Kittur et. al 2011  implemented the map-reduce framework. Kittur et. al 2008  introduced additional questions that have verifiable quantitative answers. Gupta et.al 2012  proposed majority of agreement in language digitization. Mason & Watts 2009  demonstrated the anchoring effect and claimed that increased financial incentives doesn't affect the quality of work produced. Chandlera et.al 2010  provided context about the task to improve the task results. Dow et.al 2012  introduced the rubric for self and peer assessments.
As crowdsourcing is becoming more complex (see Retelny et.al ), additional standards are required to ensure the quality of results. Due to diverse nature of the micro & expert tasks involved, it is often challenging to introduce common standards that will help produce high quality results. However, new mechanism can be designed to compliment existing standards and augment the quality of results.
Establishing the workers' trust in the requestors
Workers, on other hand don't believe in the requester's intention to pay and treat them fairly. Several forums (Turker Nation, MTurk Forum, MTurk Grind,) and platforms have been created to allow workers voice their opinions. Turkopticon Irani et.al 2013  allows workers to evaluate the requestors on AMT. However, most of the review forums are isolated from the requestors' reach and they rarely get to see the a direct feedback about their work.
From the workers' & requestors' perspective it is crucial to estimate the amount of time require to complete the task. Cheng et.al. 2015 proposed the Error-Time-Area, which measures the effort required in the crowdsourced tasks and determines the its fair price. This research opens various opportunities to estimate & quantify the risk involved in crowdsourcing work.
Missing dimensions of the trust and reputation, the fundamental problem associated with the current crowdsourcing marketplaces
Dasgupta 2000  highlights the relationship between Trust and Reputation. Reputation is a capital asset that can help build the trust between strangers. Reputation can be earned or destroyed by pursuing certain courses of actions. According to Lewis & Weigert 1985 , McAllister 1995 , trust is a multidimensional construct that has cognitive and affective dimensions. The cognitive dimensions consist of competence, reliability, and professionalism, whereas the affective dimensions describe caring, benevolence, & emotional connection to each other. The workers and requestors in the traditional Crowdsourcing Marketplace operate in isolation with minimal communication, which leads to the lack of caring and emotional connection. Over the period of time this imbalance in the affective elements becomes a recipe for breaking the trust equilibrium.
In this paper, we propose the Trust Rank framework that help establish the trust using social ranking & reputation system. In addition, it can help address following key questions:
- How should I select potential employers and tasks that will help me earn money and advance my career?
- How should I select potential employees to work on my important project? I am under the budget constraints and looking for high quality results
- As a market maker/designer, how should I attract and motivate talented requestors & workers to join the crowdsourcing marketplace?
Credit Ratings, Branding, & Reputation in the Financial Systems & Other Industries
- Trust & Reputation
Over the centuries we have developed a strong trust in the financial and economic systems. Credit Ratings play a vital role in sustaining the trust between the market participants. The ratings demonstrates the reputation & financial power of the Nation, Institutions, Corporate Companies, and Individual Citizens. In case of a citizen, credit history provides a record of borrower's repayment of loans. The Credit Score is a function of the citizen's credit history. Good Credit Score can help the citizen make fortune, but bad Credit Score can make his life harder, costing him a job. For example, TSA, leading US government agency doesn't hire applicants with "delinquent debt" Bad Credit .
- Risk Management
In structured finance, securities are categorized based on the ratings and investment risk involved. An investor uses various strategies to optimize his portfolio and reduce the non systematic risk. He can diversify the portfolio by investing in different assets.
- Reputation System in Glassdoor, Job Search Platform
In addition to financial industries, rating systems are widely used by leading job search companies such as Glassdoor. The site uses rating based on the scale of 0 to 5 and covers various aspects including Reviews, Salaries, and Interview Experiences. The ranking based reputation helps people trust on companies that will offer good salaries and fulfill their career goals.
Trust Rank, Sustainable Reputation Framework for the Crowdsourcing Marketplace
- Reputation Score and Ranking as indicators of the trust
Based on the proven practices in the financial systems, we created the Reputation Score for participants in the Crowdsource Marketplace. The Reputation Score is a function of historical activities of the participants. Using the Reputation Score we built the Global Ranking of the participants.
- Balancing Trust with Risk Diversification, first time used for the crowdsourcing market
Trust Rank Framework allows the market participants to gauge how much trust they want to put on other party. In addition it helps them ask following questions and maximize the returns on task portfolio and fulfill the expectations listed in Exhibit 1.
- A Worker is concern about What is the likelihood that a requestor will neither default on payment nor reject the task unfairly?
We build the diversified Tasks Portfolio for Workers that will help him select the set of tasks that equally distribute the risk of payment defaults and maximize the gain. Workers can set up the goal, the system will recommend the tasks based on risk appetite and time horizon.
- A Requestor is concerned about: What is the likelihood that workers will produce a great work?
We build the diversified Workers Portfolio for Requestors that will help him select workers so that they would diversify the risk of quality default and maximize the gain. Here gain might be Budget/resource allocation or optimization.
- A Worker is concern about What is the likelihood that a requestor will neither default on payment nor reject the task unfairly?
Score Parameters, how to rate along different axes?
- Rating Parameters for Requestors: Generosity, Fairness, Promptness, and Communicativity are widely used ranking parameters Irani et.al 2013 . In addition, we introduce Probability of Payment Defaults, PPD as a parameter to indicate historical track record of the requestors.
- Finally, we associate weights to each parameter and determine the final score as a function of the parameters. In long run, as we collect more parameters, we can reduce the dimension of the data using PCA. Then we can select the components that explain high variance in the ranking and derive loadings (weights) from that.
- Latent factor recommendation system gives ability to automatically learn the features require for the rating prediction.
- We can also incorporate extra parameters used in recruitment sites such as Glassdoor.
- Workers: Similarly rating for workers can be derived based on Honesty, professionalism, past performance, commitment, and skill sets, widely used parameters while hiring the workers.
How will market participant understand the ranking mechanism?
- Information about the ranking parameters can be given when the users sign up for the platform.
- Credit Score prediction algorithm are blackbox to all the citizens, but we know what makes a good/bad credit score. Similarly, the crowdsourcing system can explicitly communicate fundamental values expected while operating in the marketplace.
Advantages of the Trust Rank Framework
- Trust Rank Framework enables participants trust each others using the Reputation Scores and Ranks.
- Trust Rank Framework allows participant foresee the risk involved while working with someone (quality) or working for someone (fairness).
- Trust Rank Framework leverages leaderboard to force participant to engage in open & fair practices. Leaderboards have been widely used in gaming and crowdsourcing and have helped participant stay motivated.
- Trust Rank Framework allows community members to shine and inspire other workers and future generations, see Exhibit 5. In addition, participants will have feeling of belonging to the community.
We propose Trust Rank, a Reputation & Trust Framework for Volatile Crowdsourcing Marketplaces based on three components: first, Reputation Score Based Social Ranks, second, Task Portfolio Risk Diversification, and third Task & Risk specific Recommendation Engine. In what follows, we discuss the system in further details.
1) REPUTATION SCORE BASED SOCIAL RANKS
- Reputation Score & Mechanism to build the Ranking Profile: Reputation Score is similar to the Credit Score or Credit Ratings used in the financial systems. Workers rate the requestors based on various ranking parameters. Generosity, Fairness, Promptness, and Communicativity are widely used ranking parameters Irani et.al 2013  . In addition, we introduce Probability of Payment Defaults, PPD as a parameter to indicate historical track record of the requestors. The Ranking Function maps the ranking parameter vector to overall Reputation Score & Categories , which indicates the likelihood of misconduct of the requestor. Higher the Reputation Score, lower the chance of misconduct.
- Leaderboard Social Reputation Reputation Score is used to encourage and reward the top performers and punish the bad guys, see Exhibit 4. Depending on the scores, the workers/requestors are ranked and categorized in following buckets:
- A, HALL OF FAME: score above 90%. Perks: 1) Membership to the Crowdsourcing Standards Institute, which defines the best practices to sustain the marketplace. 2) Ability to post various tasks infrequent number of times 3) Ability to select the top performing crowd for the work 4) allowed to manage/lead communities 5) Minimum Wage: Will need to pay historic average payment for the task
- B, FAIR: score between 70%-90%. Perks: 1) Manage/lead certain number of communities 2) Ability to post various tasks infrequent number of times 3) Temporary Membership to the Crowdsourcing Standards Institute 4) Certified to run task design workshops for other requestors 5) Minimum Wage: Will need to pay historic average payment for the task
- C, GOOD: score between 40%-70%. Perks: 1) Allowed to be volunteer and serve the marketplace 2) Allowed to post various tasks certain number of times of the week 3) Minimum Wage: Will need to pay some percent higher than the historic average payment for the task
- D, POOR: score between 10%-40%. Perks: 1) Can post limited number of tasks per week 2) Allowed to be volunteer 3) Encouraged to attend the task design workshops and improve the performance 4) Minimum Wage: Will need to pay higher than historic average payment for the task
- E, HALL OF SHAME: score below 10% of the requestors- serious violators, banned from the community.
- Exhibit 4 show the Leaderboard Interface that provides extrinsic motivation for workers and requestors participate and engage in high standard practices. We propose similar mechanism for ranking the workers.
2) TASK PORTFOLIO RISK DIVERSIFICATION
- Risk diversification allows a worker choose the tasks depending on his risk appetite. Imagine a worker who wants to earn certain amount of money within a given timeframe. Based on the Reputation Scores, he builds the task portfolio and maximizes the profit by diversifying the risk.
- As shown in the Portfolio Manager Interface (see Exhibit 6), the worker uses risk parameters and selects the tasks depending on a risk associated with it. The system presents, workers with Live Portfolio Health and indicates the probability of default.
- Workers & requestors will be able to calculate the expected value of Returns before they decide to add the task to their portfolio.
3) WORKER -TASK & REQUESTOR WORKERS RECOMMENDATIONS FOR THE PORTFOLIO
- To help the workers select tasks, we combine Recommendation engine with Reputation Scores. The recommendation engine is build using the baseline method with latent factor model and avoids the classic cold start problem.
- Diversified Task Portfolio for the Workers: Recommend the combination of the tasks that would diversify the risk of payment defaults and maximize the gain. Workers can submit the diversification criteria at run time.
- Diversified Worker's Portfolio for the Requestors: Recommend the combination of the workers that would diversify the risk of quality default and maximize the gain. Requestor can submit the diversification criteria at run time. This will help the requestors utilize his budget effectively and select a team of workers with different level of experiences.
The main goal of Trust Rank is to establish broken trust between requestors and workers.
- We hypothesize that the Trust and Social Reputation will have positive influence on quality of results submitted and prevent the unjustified rejections.
- REPUTATION SCORE will help Workers & Requestors diversify the risk associated with the tasks & accomplish their targets.
Study 1: Controlled Experiment & Survey
- We will run a field deployment with two groups of randomly selected requestors and workers for duration of one month. In first group participants will be exposed to MTurk, where there is not REPUTATION RANKING is present. Second group will be exposed to Trust Rank Framework.
- We will built the requirements for 10 tasks with various complexities, skill levels, and duration. Then Requestors participating in the groups will design the HITs and publish it to the crowd.
- We will measure the results submitted by the workers in both groups based on three categories: 1) Above Par 2) At the Par 3) Below Par. We will also record the time required to complete the task and assign scores for submissions in each category.
- We will measure the quality of rejections by requestors in both groups based on three categories: 1) Fair Rejection 2) Unfair Rejection 3) Disputable Rejection. We will record the time required to complete the review and assign scores for rejections in each category. Scores are based on quality of feedback given by requestors.
- We will derive dependent variable REPUTATION SCORE based on results of above 2 factors.
- We will conduct a survey on Trust Rank and MTurk: 5 Pointer Likert based on three fundamental aspects Trust, Reputation, Risk Appetite, & Motivation behind participation
- Analysis with t-test (Welch's)
- Initially, we can run the test for normality in the data. We will plot the histograms and QQ graphs and then run both Shapiro-Wilk and Kolmogorov-Smirnov test on data (both tests are run because sample size is smaller)
- With Welch's test we can find whether the significant effect REPUTATION SCORE outperforms the Current marketplaces. We will report statistics as (t(28)=2.20, p<0.05, Cohen's d=0.95) numbers are made up
- To understand the effect of diversification of the portfolio we expose the group 1 to the REPUTATION RANKING and run study for 2 more weeks. Automated algorithm can calculate probability weighted average of returns works/requestors will earn in they decide to include particular task to their portfolio.
Study 2: Evaluation of the Ranking Function
- Ranking Function Maps various ranking parameters to the Ranking Score and Ranking Categories including A, B, C, D, E
- We will run Principal Component Analysis on the ranking parameters and select top 3 components that explains over 85% variance in the ranking data.
- We will then apply Multinomial logistic regression or Neural Network Classifier predict the Ranking Class of the requestor.
- We can measure the performance of the classifier based on the Test Dataset. We store the historical rating data in the databases.
- Performance of the Recommendation Engine can be evaluated in similar fashion using historical data.
- Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '08). ACM, New York, NY, USA, 453-456. DOI=10.1145/1357054.1357127 http://doi.acm.org/10.1145/1357054.1357127
- Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. CrowdForge: crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology (UIST '11). ACM, New York, NY, USA, 43-52. DOI=10.1145/2047196.2047202 http://doi.acm.org/10.1145/2047196.2047202
- Bernstein, M.S., Little, G., Miller, R.C., Hartmann, B.,Ackerman, M.S., Karger, D.R., Crowell, D., and Panovich,K. Soylent: a word processor with a crowd inside. Proc ofACM Symposium on User Interface Software and Technology (2010), 313–322.
- Chandler, D. and Kapelner, A. Breaking monotony with meaning: Motivation in crowdsourcing markets. Universityof Chicago mimeo, (2010).
- Daniela Retelny, Sébastien Robaszkiewicz, Alexandra To, Walter S. Lasecki, Jay Patel, Negar Rahmati, Tulsee Doshi, Melissa Valentine, and Michael S. Bernstein. 2014. Expert crowdsourcing with flash teams. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14). ACM, New York, NY, USA, 75-85. DOI=10.1145/2642918.2647409 http://doi.acm.org/10.1145/2642918.2647409
- Dasgupta, Partha (2000) ‘Trust as a Commodity’, in Gambetta, Diego (ed.) Trust: Making and Breaking Cooperative Relations, electronic edition, Department of Sociology, University of Oxford, chapter 4, pp. 49-72
- David Martin, Benjamin V. Hanrahan, Jacki O'Neill, and Neha Gupta. 2014. Being a turker. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW '14). ACM, New York, NY, USA, 224-235. DOI=10.1145/2531602.2531663 http://doi.acm.org/10.1145/2531602.2531663
- DJ McAllister Affect-and cognition-based trust as foundations for interpersonal cooperation in organizations - Academy of management journal, 1995
- Gupta A, Thies W, Edward Cutrell, and Ravin Balakrishnan. 2012. mClerk: enabling mobile crowdsourcing in developing regions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '12). ACM, New York, NY, USA, 1843-1852. DOI=10.1145/2207676.2208320 http://doi.acm.org/10.1145/2207676.2208320
- Langohr, Herwig M. The rating agencies and their credit ratings : what they are, how they work and why they are relevant, 2008
- Lewis, J. D., & Weigert, A. (1985). Trust as a social reality. Social Forces, 63, 967–985.
- Lilly C. Irani and M. Six Silberman. 2013. Turkopticon: interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '13). ACM, New York, NY, USA, 611-620. DOI=10.1145/2470654.2470742 http://doi.acm.org/10.1145/2470654.2470742
- MTurk Demographics http://demographics.mturk-tracker.com/#/gender/all
- Panagiotis G. Ipeirotis. 2010. Analyzing the Amazon Mechanical Turk marketplace. XRDS 17, 2 (December 2010), 16-21. DOI=10.1145/1869086.1869094 http://doi.acm.org/10.1145/1869086.186909
- Steven Dow, Anand Kulkarni, Scott Klemmer, and Björn Hartmann. 2012. Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work (CSCW '12). ACM, New York, NY, USA, 1013-1022. DOI=10.1145/2145204.2145355 http://doi.acm.org/10.1145/2145204.2145355
- Justin Cheng, Jaime Teevan, and Michael Bernstein. Measuring Crowdsourcing Effort with Error-Time Curves. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2015), Seoul, Korea, April 2015.
- Winter Mason and Duncan J. Watts. 2009. Financial incentives and the "performance of crowds". In Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP '09), Paul Bennett, Raman Chandrasekar, Max Chickering, Panos Ipeirotis, Edith Law, Anton Mityagin, Foster Provost, and Luis von Ahn (Eds.). ACM, New York, NY, USA, 77-85. DOI=10.1145/1600150.1600175 http://doi.acm.org/10.1145/1600150.1600175
- Malone, Thomas W. and Laubacher, Robert and Dellarocas, Chrysanthos, Harnessing Crowds: Mapping the Genome of Collective Intelligence (February 3, 2009). MIT Sloan Research Paper No. 4732-09
- Kittur, Aniket and Nickerson, Jeffrey V. and Bernstein, Michael S. and Gerber, Elizabeth M. and Shaw, Aaron and Zimmerman, John and Lease, Matthew and Horton, John J., The Future of Crowd Work (December 18, 2012). 16th ACM Conference on Computer Supported Coooperative Work (CSCW 2013)