Summer Milestone 9 Reputation Systems research and exploration

From crowdresearch
Jump to: navigation, search

Evaluated Papers

Papers related to reputation systems and a summary of the pros and cons of the implementation.

Note: Professor Michael Bernstein gives the following advice:

"Reading papers efficiently is an important skill. First, figure out what the main research question is that’s being asked. That’s usually in the Abstract. Then read the intro to make sure you understand the high level approach they’re taking. Skim the related work to make sure you understand the general space they’re working in, then pop over to the conclusion to make sure you understand what they think they did. Only then do you dig into the meat of the paper. On your first read, don’t try to understand every detail. Aim for a general understanding. We can always come back and pick apart the details later."

Reputation Inflation: Evidence from an Online Labor Market [1]

William Dai and Nisha K.K. (Team Pumas)


  1. The average public feedback given to the workers/requesters are highly inflated
  2. Main reason is the fear of retaliation, threatening or bribery.
  3. There is a difference in the public and the private feedback provided, which shows that the feedbacks given on the platform are less honest.
  4. Experiments were also conducted to study the effect of private feedback feature on getting evaluated/interviewed/hired for a task.


  1. the implementation of private feedback feature was helpful in analysing the issue
  2. was able to draw difference in the feedback given publicly and the one genuine one
  3. with high private feedback score, workers were getting hired more frequently for a number of tasks.


  1. There is no definitive way to prevent workers and requesters from discussing mutual plans to rate each other outside channels regulated by the platform
  2. One conclusion has been that the more truthful ratings are publicly available, the higher the likelihood of lies to cover occasional drops in quality
  3. Since the employers are mainly considering the worker based on the feedbacks given to them and their prior experience on the platform, the workers new to the platform would be disadvantaged.

Detailed Wiki: see [2] or Horton Reputation Paper Analysis - Team Pumas

Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text [3]

Juechi and @alfonsoxw


Traditional recommendation systems predict user ratings based on the numeric ratings but they ignore the other form of the user feedback--review text. In this paper, the authors combine the numeric ratings and review text to understand the latent user and product dimensions (they name this method ‘Hidden Factors as Topics’, HFT model). And there are two versions of this method, one is to uncover ‘preferences’ of the user and the other is to uncover ‘properties’ of the product. HFT works by aligning hidden factors in product ratings with hidden topics in product reviews. These topics act as regularisers for latent user and product parameters, which can help accurately fit user and product parameter with only a few reviews.


a) Can solve the ‘cold-start’ problem. When a user or a product is new to the platform, we can accurately analyze the ‘preferences’ of the user and the ‘properties’ of the product.

b) Better prediction of ratings by harnessing the information present in review text, especially for new products and users, provide substantial information from the text of even a single review.

c) Facilitating novel tasks such as genre discovery, to identify useful and representative reviews

d) The method readily scales to very large size of dataset. They apply their model to novel corpora consisting of over forty million reviews, from nine million users and three million products.


a) the time spent on training and fitting the model is far more than other recommender models like latent factor model and latent dirichlet allocation.

b) although this method can solve the ‘cold start’ problem, the authors didn’t verify the few reviews for the newbies.

Inefficient Hiring in Entry-Level Labor Markets [4]



  1. This article provides and evaluation of whether inexperienced workers would benefit from obtaining a job on a crowdsourcing platform to give them a chance to demonstrate their abilities.
  2. A field experiment was done in oDesk where 952 randomly selected workers were hired and given either detailed or coarse (not detailed and vague) public evaluations
  3. The paper assesses the impact of 'i) giving jobs to inexperienced workers and ii) giving the market more information about workers' job performance on their future employment outcomes and the market as a whole'
  4. The researcher identified that workers benefited from getting an experimental job because the hiring of inexperienced workers generates information about their abilities
  5. Novice workers were not being hired in sufficient numbers


  1. Workers with more detailed feedback on their completed tasks were more likely to be employed again
  2. Workers with more detailed feedback on their work were more likely to request higher wages. In fact the inexperienced workers' earnings approximately tripled as a result of obtaining a job
  3. Providing workers with more detailed evaluations increases how much they make and how much they ask for
  4. Giving more detailed information about worker quality makes workers more valuable to firms
  5. One thing the researcher identifies is that “detailed performance evaluations helped those who performed well and hurt those who performed poorly” however, I’m not sure how she can draw that conclusion because she’d previously mentioned that she wasn’t allowed to provide detailed evaluations to workers with low ratings.


  1. Inefficiently low hiring of novice workers led to diminished employment and output in oDesk
  2. This was not outlined in the article, but with a reputation system like this there is an inherent risk for requesters to hire workers with no previous ratings/experience. What other mechanisms need to be in place to identify reputation
  3. Discovering a worker’s ability is similar to general skills training; both produce future productivity benefits bu require up-front investments

Angela Richmond-Fuller (ARF) asks who would foot the bill for this? Should requesters be encouraged to hire newbies in some way or should it be the role of the platform to provide the first task for the novice workers to complete? ARF propose that Daemo includes some sort of incubator space for newbies to take on tasks whilst supervised or in small groups with other newbies so they can gain their first reputation ratings if star ratings is something we go ahead with implementing.

Rating Friends Without Making Enemies [5]

Surabhi + Kajal


The paper focuses on the concept of rating amongst peers on ‘‘, an online social networking platform. ‘Couchsurfing’ is a platform where travelers can meet hosts who are willing to lend them a couch to stay. Users of such a platform have high concerns about trust and friendship amongst their network. These concerns are addressed by the platform by having a rating system based on ‘friendship ‘and ‘trust’. The paper discusses the following:
- Any relation and or pattern between ‘friendship’ and ‘trust’
- Reciprocal ratings
- Reaction to ratings
- Public versus private rating system and its effect on the reciprocal ratings
- Quantification of relationships on a predefined scale
- The effect of language and culture on ratings
- Effect of ratings on others reputation
- Timing of ratings relative to the evolution of a relationship


  • Close friendship involves trust but high trust levels do not necessarily need close bonds
  • Publicly shown ratings are found to be more reciprocal than ratings that were held in private
  • Trust scores between users are correlated with similarity of user’s profiles and preferences
  • Trust was found to be more amenable to quantify in ‘levels’ than friendship


  • Too little incentive to provide an honest opinion
  • Difficulty in quantifying relationships on a predefined scale
  • Concern over a friend’s reaction to a rating
  • Lower private trust levels were observed on the basis of lower reciprocating of vouches
  • Lack of incentive for leaving negative references

Detailed wiki: [6]

Reputation Transferability in Online Labor Markets [7]

Claudia Flores Saviaga (@claudiasaviaga)


  • In online marketplaces such as oDesk, AMTurk and TaskRabbit, employers post tasks on which contractors work and deliver the product of their work online.
  • In these marketplaces, reputations systems play an important role in instilling trust and often are used by employers to predict the worker's future performance. However, the tasks available in such marketplaces span across a variety of different categories, which leaves the employer with the issue of trying to guess how this reputation, in different task categories, is mapped to the category at hand.
  • This paper analyzes how past, task-specific reputation can be used to predict future performance on different types of task.
  • The paper explores the following questions:

- Are reputations transferable across categories and predictive of future performance?

- How can we estimate task affinity and use past information to best estimate expectations of future performance?

  • To answer these questions, the authors use a set of real transactional oDesk data consisting of over a million real transactions across six different categories from the oDesk marketplace (Software Development, Web Development, Design and Multimedia, Writing, Administration, and Sales and Marketing). The data was collected between September 1 and September 21 of 2012.
  • It is important to remember that in oDesk, the employer, upon task completion, supplies feedback scores -- integers between 0 and 5 in the following six fields: Availability, Communication, Cooperation, Deadline, Quality, and Skills. The average of these scores represents the final 'star rating' of the worker.
  • The authors analyzed the information using a binomial model and a multinomial model from the assumption that the latent qualities of the workers are static, in the sense that they assign equal weights to past ratings. Then they use a linear dynamical system (LDS), to take into account that as users complete more and more tasks, their more recent tasks are more predictive than their initial and older completed tasks. As expected due to its simplicity, the binomial model performs worse than the multinomial model, which in turn performs worse than the LDS.
  • The paper concludes that reputation systems can be improved if the feedback scores of the participating users are adjusted to take into account the type of task that a worker is expected to complete (or has completed), as well as the user's past category-specific performance history.


  1. The paper shows a clear approach for analyzing the correlations between different tasks categories and as a result,provide a more accurate estimate of a worker's performance in a new category. This can allow employers to make safer and better-informed decisions about which workers to hire.
  2. The authors also suggest that their approach can also be used to recommend workers to apply for tasks that are seemingly out of their scope but for which these contractors are highly likely to provide successful outcomes.


  1. It does not address the problem that new workers have when they first join this online markets and have no reputation at all.

Teapot - Trust Network (Stanford U. research project) [8]

Aditi Mithal and @dilrukshi


Teapot : "transactions and exchanges along paths of trust"

This platform was developed to enable trusted interactions on the web, as a Stanford research Project. Every aspect of transaction and communication involves trust, it is an important factor while buying and selling a product or assigning a task and getting the right results. The question is, how can one analyse the trust factor existing between people? Teapot is a trust network which decides who to trust and who not to by analyzing the online interactions and social network of the users.

Teapot finds a "Trust Path" between the users and establishes a trust network which determines how much a user is likely to trust the user on the far end. It captures the idea of transitivity of trust wherein it determines the level of connectivity that exists between the two users on the basis of mutual friends. This also determines the trust score. Having a "shared background" enables stronger trust relations because one is likely to trust someone having the same background. Any references made by the users for building stronger trust relations are also taken into account by this platform.


  1. Everyone likes to trust someone they know, and this works like a powerful recommendation system. Teapot uses this feature of connectivity and trust existing in relationships to make user's transaction and experience better.
  2. In a trust network, one would not likely indulge in fraudulent activities as it would disrupt all his/her connections. So, this eliminates unwanted users and online fraudulent practices.
  3. Teapot solves the cold-start problem by making trust portable across marketplaces. (paper ref.) [Also, Teapot provides access to its reputation system through simple, easy-to-integrate web-based APIs.]
  4. Transactions based on such a trust network reduces anxiety and boosts a larger trust network.


  1. There is a possibility that weaker intermediate relations are found between the users which is not enough to establish healthy trust relations.
  2. Social Network: Not everyone is a member of a social networking site (like Facebook). The platform takes into consideration the social circle existing on Facebook only, which eliminates the possibility of more existing connections.

Dilrukshi's view

My takeaway from this project is very similar to what we envision during our group hangouts in Team iceLearn. We all work by the trust and knowing to each other.

In this case with an example: let's say you want to hire a web designer. You log on to the site and add your task description. When you get the results from workers interested in doing this job, imagine that you found a person known by your friend or friend's friend. It's human nature to trust and therefore to tend to give the work to such person even if he/she is new to the task.

If we can integrate to render who is known to whom in our platform as one factor for the reputation, I think it may carry a certain value to the reputation. (It is one of the factors we should look at while having many other factors to decide reputation)

Teapot trust.png

Have you Done Anything Like That? Predicting Performance Using Inter-category Reputation [9]

Rahul Sheth


  1. In the current reputation system, the reputation of a worker when he is exploring a new category is determined by uniformly averaging all of his/her ratings in the previous categories that he/she has been involved in.
  2. The researchers believe this method is inherently flawed because a worker may have completed tasks in categories that have no relevance to the new category. For example, the researchers believe that it makes no sense that reputation is based on a completed task in language translation and reputation based on a completed task in website design have the same weight when a worker attempts to complete his first task in mobile application development.
  3. To solve this, the researchers created a new model that assigns weighting to previous tasks based on their relevance to the new category.
  4. Note that ratings would now have to be category specific if we intend on using this model.
  5. This model was met with great results. Quantitatively speaking, we saw a 47% increase in accuracy when completed in an isolated scenario, and a 43.5% increase in accuracy using oDesk data.


  1. There is evidence of increased accuracy from the baseline (status quo methods) using their model, both on a random scale and on a real-world application scale.
  2. This works with both sparse data models and even when there are dense data models. This will make sure that this approach will always work, regardless of data points.
  3. Filters out information that may be irrelevant to the task at hand.


  1. Can only be used to modify rating scores shown to employers. Essentially may be a limited approach for our diverse website.
  2. May filter out some important information as well.
  3. If a worker has never completed a task in a relevant category before, he/she will be disincentivized from completing a new task since there is a low chance that he/she will be hired. A decrease in worker freedom in this way could push people away from our interface.

Expansion on Con 2: Although categories like language translation may not have any sort of relevance to a programming task in terms of expertise, there are some other relevant factors, like worker's dedication, communication skills, and trustworthiness. In this setup, these factors are not reflected as it's part of a seemingly irrelevant category.

Content-Drive Reputation for Collaborative Systems [10]

Ryan Compton


  • Main goal is to judge users by their actions, rather than by the word of other users. Users gain or lose reputation according to how their contributions fare; users whose work is undone lose reputation.
  • In content driven reputation systems every user is turned into an active evaluator of other users’ work by the simple act of contributing to the system. Furthermore, a content driven reputation system is resistant to bad mouthing and attacks.
  • The authors' main contribution is the idea and formulation of a pseudometric which automatically judges users' reputation. They demonstrate a pseudometric on document versions, specifically document versions of Wikipedia files and git repositories. Pseudometrics satisfy two requirements:
  1. Outcome preserving: If two versions look similar to users, the pseudometric should consider them close. Assigns a distance of 0 to versions that look identical
  2. Effort preserving: If a user can transform one version of a document into another via a simple transformation, the pseudometric should consider the two versions close.


  • Overcomes known issues with user generated ratings which are:
  1. User ratings can be quite sparse
  2. Gathering the feedback and the ratings require secondary systems outside of main system
  • Their pseudometric was mathematically proven to be resistant to known types of attacks (Sock Puppet accounts)
  • Reputation is preserved in accordance to a set timeline


  • No field data on running this reputation system
  • Two main requirements for such a system:
  1. Ability to embed document versions in a metric space, so that distance is both effort-preserving and outcome-preserving.
  2. Presence of patrolling mechanisms that ensure that the system does not have “dark corners” (when all edits are viewed in timely fashion by honest users)
  • While the pseudometric is shown to work well within document space as well as GIT hub contributions, a general model is a hard problem as defining a measurable contribution in more complex collaborative systems (they cite SolidWorks) would need to be set in stone.

Liquidity in Credit Networks: A Little Trust Goes a Long Way [11]

Tejas Sarma

Summary and Main Points presented by the paper

  1. The paper compares and analyses some of the existing credit networks and provides mathematical models for the same which helped the authors come to the conclusion that provide suitable proof that the transaction failure probability in these networks is independent of the path along which transactions are routed.
  2. One of the main questions that were answered by the authors through this paper was: how long can the network sustain liquidity, i.e., how long can the network support the routing of payments before credit dries up?They have answered this question in terms of the long term failure probability of transactions for various network topologies and credit values.
  3. The paper begins with a simple example of a credit network involving three nodes. The credit paths have been analysed and illustrated efficiently to explain the fact that routing payments in credit networks is identical to routing residual flows in general flow networks.
  4. This analysis is followed by answering a few more questions that emerged. If the network is sufficiently well-connected and has sufficient credit to begin with, can we sustain transactions in perpetuity without additional injection of credit? How does liquidity depend upon network topology and transaction rates between nodes, and how does it compare with the centralized model described above?The paper presents analytical and graphical simulations to support their point. Apart from these factors, other parameters such as the node bankruptcy probability in credit networks has been analysed with suitable proof.
  5. The next step is where the authors provide their own model of a credit network that includes the failure probability comparison and equivalent centralized currency systems of Star, Line, Cycle, Complete, Erdos-R¨enyi, and Barabasi-Albert network topologies.
  6. The main analysis begins with an explanation of the Combinational Structure along with the various theorems associated under it.
  7. After this, the paper presents the steady state analysis with the help of which they have been able to prove prove that the Markov chain induced by a symmetric transaction rate matrix Λ has a uniform steady-state distribution over C. This proof has been base on the individual proofs of three theorems. This steady-state analysis has then been extended to the various network topologies discussed earlier.
  8. After introducing credit networks as an alternative to a centralized currency infrastructure, the focus shifts to the next question: how does the steady-state failure probability in a credit network compare with that in a centralized infrastructure?This consists of the analysis of the equivalence between Credit Networks and Centralized Currency Infrastructure and the detailed comparison of liquidity between the various network topologies
  9. After the analysis, the authors have provided quantitative simulation data as the results of simulating repeated transactions on credit networks from two well-studied families of random graphs.
  10. The simulation covers the analysis of the effect of Variation in Network Density, Credit Capacity, and Network Size which has been supported with ample graphical and pictorial data sets.
  11. Conclusions and further research paths: The paper generalizes the credit network model by allowing every node to print its own currency, and formulates and studies the question of long-term liquidity in this network under a simple model of repeated transactions. Using the notion of cycle-reachability, the authors have shown that routing payments in credit networks has a path-independence property. They also have shown that the Markov chain induced by a symmetric transaction regime has a uniform steady-state distribution over the cycle-reachable equivalence classes. They used this fact to derive the node bankruptcy probability in general graphs and the transaction failure probability in a number of network topologies. They have also shown that, except in cycles, these probabilities are within a constant factor of the corresponding values for an equivalent centralized payment infrastructure. Their analysis and simulations show that for a number of well-known graph families, the steady-state failure probability under reasonable transaction regimes is comparable to that in equivalent centralized currency systems. Thus,they conclude that, in return for the robustness and decentralized properties, this model does not lose much liquidity compared to a centralized model. They have also addressed the open problems related to Liquidity in these networks.


  1. The paper has covered most of the aspects and factors of credit networks that are required to support their conclusions.
  2. They have provided quantitative comparisons of various network topologies based on several factors that have been supplemented with detailed proofs of the theorems involved in their analysis.
  3. They also have managed to convince the readers that their propose network model surpasses the conventional ones used presently.
  4. They have provided the necessary graphical and analytical data sets as a part of the basis of their research.
  5. On the whole, the paper has aimed to address the various conflicts present in conventional network models and provide alternatives for these models.


  1. The conclusions drawn have been based entirely on mathematical proofs and datasets. It fails to provide sufficient practical evidence to stabilize its conclusions.
  2. The paper could have been more descriptive in the analysis of the network topologies instead of maintaining their focus on only the mathematical aspects.
  3. In some parts, the paper seems to deviate from its main goal and analytical parameters.
  4. There are some hypothetical models provided, that could be modified for them to be practically applicable.

Opinion Mining Using Econometrics: A Case Study on Reputation Systems [12]

Rijul Magu (@rijul)


The paper analyzes the effect of textual feedback on the pricing power of merchants (on from an economics point of view. Econometrics is used to determine the economic value of opinion phrases in order to evaluate the strength and polarity of opinions. It is noted that certain sellers are able to sell a given product at a higher price than their competitors as a direct result of greater apparent reputation. This difference in price is termed "price premium".

A dataset of 9500 transactions was collected from Amazon Web Services between October 2004 and March 2005. Reputation profiles were constructed using numeric and textual feedback. A matrix of dimensions (features such as "packaging" assigned by numeric scores converted from adjective or adverb modifiers used by buyers) and the total number of postings was created for each seller. Corresponding to this matrix, a reputation score was obtained. To evaluate the level of association between the price premium and reputation score, ordinary-least-squares (OLS) regression with fixed effects was carried out. This allowed the authors to estimate the economic "values" of modifiers when used for specific dimensions.

An R-squared fit of the regression was examined, with and without the use of text variables. This was done to check if text carried extra information over the traditional star rating system. The R-squared value without text variables was found to be 0.35, while the value obtained by using only text-based regressors was 0.63, indicating an increase in information. An experiment was carried out to predict, given two sellers selling the same product, which seller would make the sale. An increase of 15% (from 74% to 89%) in accuracy is noted when using text reputation along with price and numeric reputation.


  • Evaluates the strength of opinions, rather than just the polarity from an economics standpoint.
  • A significant increase in accuracy of prediction implies that the approach of the authors better measures reputation than the traditional star system.


  • The authors do not specify the region over which the study was held. A particular phrase may express a considerably different strength and polarity of opinion in a different area.
  • The relative positioning of sellers on the website is not accounted for in the algorithm, which may affect results.
  • The approach does not factor in the effects of astroturfing.

I rate you. You rate me. Should we do so publicly? [13]



This paper describes how reciprocity plays an important role in building a reputation system. The author has done analysis on 3 data sets, that is, , Epinions and, it shows how rating can change a user's mind whether it involves buying some product or building trust on some user and how anonymous and non-anonymous rating works. It shows though that non-anonymous rating cannot be neglected but still the anonymous rating is equally or even slightly more helpful in building the trust and also how different factors like gender, age, geographic location affects the rating. Whether it's anonymous or non-anonymous, the ratings should never be considered as it is; one should examine the context in which they were given and only then should the decision be made.


  • Reciprocity generally helps in building reputation system.
  • Vouching is one good way of building reputation system.
  • Anonymous users tend to give good and genuine reviews since their name is involved.


  • Best friends tend to give good ratings to each other, compared to the unknown person despite seeing the latter's work.
  • Inaccurate ratings can result in trusting a user or product despite the possibility of poor quality.
  • Non-anonymous users can give bad rating to a person as a revenge even though the person is good.

Trust Among Strangers in Internet Transactions: Empirical Analysis of eBay’s Reputation System [14]

Aditi Nath + Ankita Sastry


  • This paper talks about the reputation system that exists for the eBay platform and tries to explain how this system works with empirical data from 1999.
  • It does a comparison between internet based reputation versus the reputation system of the traditional systems.
  • It states that a reputation system must meet three challenges:
  1. Provide information that allows buyers to distinguish between trustworthy and non-trustworthy sellers
  2. Encourage sellers to be trustworthy, and
  3. Discourage participation from those who aren’t trustworthy
  • Some results of analysis of the eBay dataset taken from 1999 are:
  1. Despite incentives to free ride, feedback was provided more than half the time.
  2. Well beyond reasonable expectation, it was almost always positive.
  3. Reputation profiles were predictive of future performance.
  4. Although sellers with better reputations were more likely to sell their items, they enjoyed no boost in price.
  5. There was a high correlation between buyer and seller feedback, suggesting that the players reciprocate and retaliate.
  • Although the reputation system is in place, nobody seems to be aware of exactly how it works, and that this knowledge/awareness is not necessary for it to be successful, rather they just need to believe that the system works.

How eBay works

  1. Register with email id and username (can be any name/moniker)
  2. Neither buyer or seller can see real name or address information
  3. Initially anybody was allowed to give feedback about buyer/seller, later on they changed the rule to feedback given has to be tied to a particular transaction. i.e. only the seller and the winning bidder can give feedback about each other
  4. Reputation of buyer is far less important as sellers can hold goods until they are paid and it would do no good for the seller to sell based on buyer’s reputation since it is not possible to exclude buyers with bad reputation from their auctions.


  1. What matters is not how the system works, but how its participants believe it works, or even whether they believe it works even if they have no concern about why.In the paper they make an analogy drawn from grander considerations: the behavior of man in a world without a God might be fully moral and God fearing if its denizens believed there was a God who would judge them and possibly punish them in the hereafter.
  2. Internet auction sites have developed an ingenious feedback system which enable sellers to build reputations from satisfied customers thus making up for the lack of traditional feedback mechanisms.
  3. The system is better off the way it currently is i.e. mildly dissatisfied buyers do not record their dissatisfaction. If this were done honestly by each buyer, it might destroy the overall faith that people have in the marketplace.


  1. Significant trust is required to conduct transactions and the instruments that are normally available to sellers and buyers in the traditional system are not available in the internet transactions.
  2. Getting feedback is not a problem, there is enough frequency of it, but the value of feedback is a concern.
  3. This paper examines a dataset from 1999 which might not be relevant at this time.
  4. The disincentive to give negative feedback might be far stronger since they fear lawsuits and retaliatory feedback.
  5. One concern which we have, which is not mentioned in the paper is if negative feedback was not given honestly, this would not project honest ratings/feedback value of sellers/buyers and might mislead the people who use the platform.

Detailed wiki:[15]

A System for Scalable and Reliable Technical-Skill Testing in Online Labor Markets [16]

Vineet Sethia


The paper focuses on Scalable and Reliable Technical-Skill Testing in Online Labor Markets.To ensure the reliability of platform & skill testing, system mines questions from Q/A sites like Stack Overflow and selects questions that could serve as good test questions for a particular skill. Our system is algorithmically identifying threads that are promising for generating high-quality assessment questions, and then uses a crowdsourcing system to edit these threads and transform them into multiple-choice test questions. To assess the quality of the generated questions, we employ Item Response Theory and examine not only how predictive each question is regarding the internal consistency to the test, but also examine the correlation with future real-world market-performance metrics.ty of skill testing in a long & persistent way, platform uses online question banks & sites and re-purposes these questions to generate tests.


1. The Platform can Decrease the impact of cheating.
2. we also create questions that are closer to the real problems that the skill holder is expected to solve in real life
3. Our platform leverages the use of Item Response Theory to evaluate the quality of the questions
4. We also use external signals about the quality of the workers to examine the external validity of the generated test questions
5. Questions that have external validity also have a strong predictive ability for identifying early the workers that have the potential to succeed in the online job marketplaces.
6. Cost is lower than the cost of producing such questions from scratch using experts.


1. Licensing is required for using questions from existing test banks.
2. A qualitative assessment of the features is required.
3. Experts may be needed to verify the problems and solutions from the text banks before using them in the platform
4. The reputation of the platform is directly affected with the reputation of question banks/sites.

Detailed wiki: A System for Scalable and Reliable Technical-Skill Testing in Online Labor Markets Paper Summary

The Utility of Skills in Online Labor Markets [17]



I brought a lot of baggage to this article, primarily in the form of a preconceived notion. I felt that the best way to build a reputation system was through mechanisms of self awareness; steps that would reduce the dissonance between perception and reality. Even with the limitations of scale and user adoption, I was convinced that was the path to uniqueness and fairness. I was wrong. This article and the critical conversations I had with Alison around complexity and ML helped me not only translate the article, but slowly convince me of the value of a Quantitatively driven big data approaches as the optimal scalable solution to our reputation system problem.

This particular article appeals to me because it proposes that reputation is tied to a relationship between Skills and Experience. Very tangible and identifiable relationships we see all the time. Drawing from: certification, feedback score, Hire rate, rehire rate, Wage mentioned and can be expanded to include timeliness, accuracy, volume/activity, and other factors, the correlation (between skill and experience) translates the associated expectations one has of such a relationship into a standardized set of metrics. Metrics that predict behavior, satisfaction, success, etc. Metrics that can be deployed to align workers and requesters based on task completion criteria.

Listening to the Rep Sys call on Monday (August 3, 2015), I noticed that the effort to integrate opinion/feedback into the system required a significant amount of manipulation, caveats or mechanisms to temper the impulses of reputation inflation. Why do we need that component in the system at all? If a worker can do the job or did the job (within the terms of the exchange), shouldn't that be the emphasis on the system? And can't that be measured quantitatively/algorithmaticaly? It may be ironic, but the less human interaction/intervention we have in the system, might there be more trust in the system? If we must engage workers and requestors, we can engage using questions: would you rehire/work with again, would you recommend to another requestor/worker, etc. If we focus on a set of factors (That can be applied to both workers and requestors): certification, feedback score, Hire rate, rehire rate, Wage mentioned, timeliness, accuracy, volume/activity, etc, we/the system can align workers and requestors with a high level of accuracy. For example. When a requestor is inputting a project, the system can ask the requestors priorities: Timeliness, accuracy, price, etc and based on the skills needed for the job, workers that possess that skill and are timely,accurate and in the range would be presented as first options. This gives the requestor what they want and encourages the workers to focus on what is important...getting the work done.


It is dynamic and takes into account the accumulation of skills, which can help with aligning newbies as well as the tricky proposition of leveling people up. The complexity of the initial research can be supplemented with additional factors. Its flexible, in that if we so decide, the data sets and processing algorithms can be re-purposed for other modules in the system (Pricing, leveling,learning). It relies on data, not opinion and feedback; hence no reputation inflation. Removing that dynamic increases trust in the system which only improves the trust relationship between worker/requester.


It is complicated! It will take a lot of time, human power and data to make it work. But when it does, it will be a game changer.

On Assigning Implicit Reputation Scores in an Online Labor Marketplace [18]


Acronyms used

WR = WorkerRank reputation system

ER = explicit reputation approach (5-star rating --current reputation system used in oDesk, etc.)

JAEp = job application evaluation phase

JC = job completion

IS = implicit signal/s


This paper proposes WorkerRank, an alternative reputation system that lets employers provide implicit signals (weighted, anonymous ratings) during JAEp, rather than the typical explicit signal (5-star feedback used in ER) after JC. Using live oDesk data, WR's experimental results showed broader coverage (more workers received reputation ratings compared to ER) and employers' hiring decisions were facilitated.

This paper echoes my idea of using an anonymous feedback model and cements my belief that we should try using such model or even integrating it to make our platform's reputation system more robust and comprehensive.


  • I think the best benefits of using WR are improving coverage and tackling one of the cold-start problems (that is, a worker has no reputation upon joining an online jobs platform). WR gives ratings on significantly more workers than ER, increasing the reputation information of the platform's worker population. WR provides reputation to workers even if they don't get hired or have never been hired on the platform (except workers who don't apply).
  • IS are anonymous and therefore puts no pressure on employers to provide a rosy or rosier-than-normal feedback.
  • If employers give an accurate and thorough feedback by way of IS, WR could help improve the quality of reputation in a platform. Unlike explicit ratings (feedback after JC as used in ER), implicit signals during the JAEp could have reduced or no skewing towards high or perfect ratings.
  • oDesk does allow employers to make such signals during the JAEp, but it is explicit, vague, inaccurate, and/or limited (oDesk asks the client during JAEp, and unhired/rejected workers would get this only relevant feedback among several feedbacks: "There is no qualified applicant for this project"). The WR builds upon this existing approach and makes it better by being anonymous, implicit, relevant, reusable, and more detailed.


  • Employers might not have the time or deliberately take the time to provide the needed IS during the JAEp, since providing feedback on more applicants (aside from the one/s an employer would hire and/or interview) could take up considerable time of the employer.
  • IS during JAEp would not be an adequate or accurate representation of a worker's actions within a job/project, since IS may not have a consistent or strong correlation with worker performance. Employer-derived IS might not be accurate, and worker performance could be better or worse throughout the project (and can regularly change), and all these would not be reflected if there is no feedback during JC.


The choices of criteria and weighting scheme to be applied to WR would make this reputation system better or worse than current reputations systems.

Jsilver's ideas

  1. We should find a way to incentivize employers to perform the implicit feedback on as many applicants as possible during the JAEp. Some employers do not reject, interview, shortlist, ignore, or make offers to workers.
  2. If we use IS (as stated in the paper) for workers, we could also apply a similar IS approach to rate and identify top employers. My idea is, employers who do take the time to give implicit ratings during the JAEp would be implicitly rated algorithmically, and those who don't would be rated accordingly as well.

Strategic Formation of Credit Networks [19]










See relevant links at [20]