From crowdresearch
Jump to: navigation, search

Problem identified

Currently, the reputation systems are used by platforms like oDesk and Amazon MTurk to basically give a review for the worker. A problem arises from this when a worker moves onto a new category and attempts a task in said new category. As this is a new field for the user, they don't have any previous tasks completed, no reviews in it, and therefore the reputation system has to be predicted by the system. The current method of predication is finding the average of all of the reputation of all of the categories. This presents a problem because some categories have no relevance to the new category. For example, a person going into their first programming task should not have their reputation based on their task translating English to Spanish.


The research study "Have you Done Anything Like That? Predicting Performance Using Inter-category Reputation" proposes a solution to this problem. The solution basically assigns a weight to previous categories based on relevance and then they find a weighted average which is the predicted reputation system for the new category.


      1. Reputation Systems are important because they instill trust and are often predictive of future satisfaction. 
      2. Labor Markets are heterogenous so current reputation system calculation methods are inadequate. 
      3. Through our solution, accuracy is up 47% from the baseline. 


      1. Work is an experience good and very hard to predict quality. (Note that we solve this with our prototype task design). 
      2. We solve this by having reputation systems. 
      3. However, it is tough to see the outcome of a programming task from a writer. 
      4. Basis of the Project: Using predicators like ratings, we can build models connecting past performance to predict future performance. 
      5. Correlation of the previous task categories influence weightage. 


      1. Typically reputation mechanisms either average out past performance of determine weight based on size of task. 
      2. The following approaches are used to estimate online review helpfulness: 
          a. Review-length. 
          b. unigrams: A type of probabilistic language model for predicting the next  item in such a sequence. 
          c. Product-rating. 
          d. Readability Test: Basically formulae for evaluating readability of something, based on syllables, words, etc. 
          e. Topical Relevancy 
          f. Reviewer Expertise
          g. Writing Style
          h. Timelines
          i. Author Background 
          j. Tsaparas Algorithms: Algorithms to create a ranking system. 
      3. The following approaches are used to estimate community question answering: 
          a. Ratings. 
          b. Answer quality (semi-supervised approach)
          c. Relevance
          d. First, id answer quality with MTurk workers, then use classifiers. 

Problem Formulation

      1. The researchers began predictions based on a Bayesian Model, which is binary in nature. Either a review is positive or negative. 
      2. "Qij" is the predicted quality of a task. With user "i" in category "j". 
      3. Their first approach basically adjusts "Qij" based on whether feedback is good or bad, which usually doesn't work because ratings are usually multinomial, like 5-star ratings. 
      4. For their equation, they need realistic values of "Qij". One way is to find the mean of the distribution "Qij". Another is through randomly sampling through the distribution. 
      5. One Common problem is data sparseness, when as more categories come up, there are less translational data points. 
      6. To solve this, we can group categories into L abstract groups. 

Synthetic Data Experiments

      1. They ran two different trials. 
      2. The first one was built around dense data, which allowed them to run the trial with ease as they could easily estimate all of the coefficients for their model. They didn't have to make category specific clusters.
      3. The second used sparse data where they used the hierarchical model to make clusters. 
      4. With the dense data model, the results show that this approach is effective even with an increasing number of categories. 
      5. The improvement stays sustained for the sparse data portion of the experiment. 

Real Data Experiments

      1. They began to run experiments on oDesk, an example of a crowdsourcing platform. 
      2. They found that the score distribution was skewed right because only the highly rated workers would stay on the interface. 
      3. The experimental results show that best-case scenario, the accuracy increases by 43.5% and worst-case scenario, the accuracy increases by 13%. \

Discussion and Future Work

      1. One problem that could affect the results is that they do not take into account time. They give the same weight to a task completed 5 years ago and a task completed a week ago in the same category. 
      2. This would reduce accuracy. 
      3. Because it was based on training tasks, the algorithm can be best used to modify rating scores shown to employers, not to help employees choose new tasks. 


      1. There is evidence of an increase in accuracy 
      2. This works with both sparse data models and even when there are dense data models. This will make sure that this approach will always work, regardless of data points. 
      3. Filters out some information that may be irrelevant to the task. 


      1. Can only be used to modify rating scores shown to employers. 
      2. May filter out some important information as well. Although categories like language translation may not have any sort of relevance to a programming task in terms of expertise, there are some other skills, like dedication, effective communication, and being a trustworthy worker. In this setup, that information does not factor over as it is part of a seemingly irrelevant category.