Summer Milestone 10 Vineet Sethia Reputation Systems Research And Exploration

From crowdresearch
Jump to: navigation, search

A System for Scalable and Reliable Technical-Skill Testing in Online Labor Markets Paper Summary



increasing need for reliably evaluating the skills of the participating users in a scalable way.

Current problems platforms are facing
1. first,cheating is very common in online testing without supervision,as the test questions often “leak” and become easily available online along with the answers.
2. Second, technical-skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art.
3. Third, there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for.

Solutions our platform provide

we present a platform, that continuously generates test questions and evaluates their quality as predictors of the user skill level. Our platform leverages content that is already available on question answering sites such as Stack Overflow and re-purposes these questions to generate tests.we continuously generate new questions, decreasing the impact of cheating, and we also create questions that are closer to the real problems that the skill holder is expected to solve in real life.
Our platform leverages the use of Item Response Theory to evaluate the quality of the questions. We also use external signals about the quality of the workers to examine the external validity of the generated test questions: Questions that have external validity also have a strong predictive ability for identifying early the workers that have the potential to succeed in the online job marketplaces.

Experimental Evaluation Results

Our experimental evaluation shows that our system generates questions of comparable or higher quality compared to existing tests, with a cost of approximately $3 to $5 dollars per question, which is lower than the cost of licensing questions from existing test banks, and an order of magnitude lower than the cost of producing such questions from scratch using experts.

Key Points

Dominant Problems in Online Labour Markets

Reputation systems are widely used for instilling trust among the participants. A reputation system for an online labor market computes a reputation score for each worker based on a collection of ratings by employers that have hired them in the past. However, existing reputation systems are better-suited for markets where participants engage in a large number of transactions (e.g., selling electronics, where a merchant may sell tens or hundred of items in a short period of time). Online labor inherently 20 suffers from data sparseness: most work engagements require at least a few hours of work, and many last for weeks or months. As a result, there are many participants that have only minimal number of feedback ratings, which is a very weak reputation signal.Hence the lack of reputation signals creates a cold-start problem causes forcing the departure of high-quality participants, leaving only low-quality workers as potential entrants.

In global online markets,skills credentialing is much trickier: verifying educational background is difficult, and knowledge of the quality of the educational institutions on a global scale is limited.So today most online labor markets offer their own certification mechanisms. The goal of these tests is to certify that a given worker indeed possesses a particular skill.For example, eLance-oDesk and vWorkerallow workers to take online tests that assess the competency of these contractors across various skills (e.g., Java, Photoshop, Accounting, etc.) and then allow the contractors to display the achieved scores and ranking in their profile.Unfortunately, online certification of skills is still problematic for a number of reasons with cheating & leak of questions being one of the biggest challenges.the reliability of the tests for which answers are easily available through a web search is questionable.

Furthermore, it is common, even for expert organizations, to create questions with errors or ambiguities, especially if the test questions have not been properly assessed and calibrated with relatively large samples of test takers. Such problematic questions introduce noise into the user-evaluation process, hindering the correct assessment of the users’s skill, and therefore need to be identified and excluded from the user-evaluation process. Finally, many people question the value of the existing tests as long-term predictors of performance, indicating that questions are calibrated only for internal validity (how predictive a question is about the final test score) and not for external validity (how predictive the question is for the long-term performance of the test taker).

What our system offers?

our system mines questions from Q/A sites like Stack Overflow and selects questions that could serve as good test questions for a particular skill. Our system is algorithmically identifying threads that are promising for generating high-quality assessment questions, and then uses a crowdsourcing system to edit these threads and transform them into multiplechoice test questions. To assess the quality of the generated questions, we employ Item Response Theory and examine not only how predictive each question is regarding the internal consistency to the test, but also examine the correlation with future real-world market-performance metrics.

Essentially, our system is composed of two main parts that can also function independently: the question generation and the question evaluation component. We introduce the following novel aspects for question generation:
• By utilizing Q/A threads as question seeds, we can continuously update our question bank with up-to-date questions related to fast evolving technical topics.
• By using actual Q/A threads as inspiration, we are testing for concepts that are proven to be non-trivial in the real world.
• By leveraging Q/A threads into test-questions, we achieve much lower costs to generate a question compared to employing experts.
• By continuously monitoring the Internet for leaked questions, we can quickly eliminate opportunities for cheating.

We also introduce the following novel aspects for question evaluation:
• By utilizing exogenous ability metrics, such as wages, we evaluate questions as predictors of market performance metrics.
• By continuously evaluating the test-questions we also find questions that have been leaked, since such questions suddenly lose their ability to discriminate between users with different ability levels.

Tools & Workflow

The architecture, components, and workflow of our platform

1. Question Ingestion Component
2. Question Editor
3. Question Reviewer
4. Question Bank: Experimental and Production
5. Quality Analysis
6. Cheater Leaker

Question Generation Process
1. Stack Exchange
2. Question Spotter

Question Quality Evaluation

Our system can scalably generate a large number of questions for skill testing.This section discusses how our question evaluation component works. The question analysis component generates a set of metrics to evaluate the quality of the questions in the question banks. We compute these metrics using standard methods from item-response-theory (IRT). IRT is a field of psychometrics employed for evaluating the quality of tests and surveys that measure abilities, attitudes, and so on.

Item Response Theory
1. Question Analysis based on Endogenous Metrics
2. Question Analysis based on Exogenous Metrics
3. Experimental Evaluation
All these Mathematical Theorms & Techniques based on Item Response Theory are used for the Quality Evaluation.


It presents a scalable testing and evaluation platform. The platforrm leverages content from user-generated question answering websites to continuously generate test questions, allowing the tests to be always “fresh” minimizing the problem of question leakage that unavoidably leads to cheating. System also shows how to leverage itemresponse-theory to perform quality control on the generated questions and, furthermore, platform should use marketplace-derived metrics to evaluate the ability of test questions to assess and predict the performance of contractors in the marketplace, making it even more difficult for cheating to have an actual effect in the results of the tests. One important direction for the future, is to build tests that have higher discrimination power for the top-ranked users than for the low-ranked ones. It is expected the use of adaptive testing to be useful in that respect, as can have tests that terminate early for the low-ranked users, while for the top-ranked users,may ask more questions, until reaching the desired level of measurement accuracy. Also, user may apply STEP for generating tests for for non-programming skills by leveraging non-technical Q/A sites, and even generate tests for MOOCs by analyzing the contents of the discussion boards, where students ask questions about the content of the course, the homeworks, etc. it is believed that such a methodology will allow the tests to be more tailored to the student population and that can measure better the skills that are expected in the marketplace.