Summer Milestone 10 Vineet Sethia Reputation Systems Research And Exploration
- 1 A System for Scalable and Reliable Technical-Skill Testing in Online Labor Markets Paper Summary
A System for Scalable and Reliable Technical-Skill Testing in Online Labor Markets Paper Summary
increasing need for reliably evaluating the skills of the participating users in a
Current problems platforms are facing
1. first,cheating is very common in online testing without supervision,as the test questions often “leak” and become easily available online along with the answers.
2. Second, technical-skills, such as programming, require the tests to be frequently updated in order to reflect the current state-of-the-art.
3. Third, there is very limited evaluation of the tests themselves, and how effectively they measure the skill that the users are tested for.
Solutions our platform provide
we present a platform, that continuously generates test questions and
evaluates their quality as predictors of the user skill level. Our platform leverages content
that is already available on question answering sites such as Stack Overflow and
re-purposes these questions to generate tests.we continuously generate new questions, decreasing the impact of cheating, and we
also create questions that are closer to the real problems that the skill holder is expected
to solve in real life.
Our platform leverages the use of Item Response Theory to evaluate the quality of the questions. We also use external signals about the quality of the workers to examine the external validity of the generated test questions: Questions that have external validity also have a strong predictive ability for identifying early the workers that have the potential to succeed in the online job marketplaces.
Experimental Evaluation Results
Our experimental evaluation shows that our system generates questions of comparable or higher
quality compared to existing tests, with a cost of approximately $3 to $5 dollars per
question, which is lower than the cost of licensing questions from existing test banks,
and an order of magnitude lower than the cost of producing such questions from scratch
Dominant Problems in Online Labour Markets
Reputation systems are widely used for instilling trust among the participants.
A reputation system for an online labor market computes a reputation score for each
worker based on a collection of ratings by employers that have hired them in the past.
However, existing reputation systems are better-suited for markets where participants
engage in a large number of transactions (e.g., selling electronics, where a merchant
may sell tens or hundred of items in a short period of time). Online labor inherently
20 suffers from data sparseness: most work engagements require at least a few hours of
work, and many last for weeks or months. As a result, there are many participants
that have only minimal number of feedback ratings, which is a very weak reputation
signal.Hence the lack of reputation signals creates a cold-start problem causes forcing the departure of high-quality participants, leaving only low-quality workers as potential entrants.
In global online markets,skills credentialing is much
trickier: verifying educational background is difficult, and knowledge of the quality
of the educational institutions on a global scale is limited.So today most online labor markets offer
their own certification mechanisms. The goal of these tests is to certify that a given
worker indeed possesses a particular skill.For example, eLance-oDesk and vWorkerallow workers to take online tests that assess the competency of these contractors across
various skills (e.g., Java, Photoshop, Accounting, etc.) and then allow the contractors
to display the achieved scores and ranking in their profile.Unfortunately, online certification of skills is still problematic for a number of reasons
with cheating & leak of questions being one of the biggest challenges.the reliability of the tests for which answers are easily available through a web
search is questionable.
Furthermore, it is common, even for expert organizations, to create questions with errors or ambiguities, especially if the test questions have not been properly assessed and calibrated with relatively large samples of test takers. Such problematic questions introduce noise into the user-evaluation process, hindering the correct assessment of the users’s skill, and therefore need to be identified and excluded from the user-evaluation process. Finally, many people question the value of the existing tests as long-term predictors of performance, indicating that questions are calibrated only for internal validity (how predictive a question is about the final test score) and not for external validity (how predictive the question is for the long-term performance of the test taker).
What our system offers?
our system mines questions from Q/A sites like Stack Overflow and selects questions that could serve
as good test questions for a particular skill. Our system is algorithmically identifying
threads that are promising for generating high-quality assessment questions, and then
uses a crowdsourcing system to edit these threads and transform them into multiplechoice
test questions. To assess the quality of the generated questions, we employ Item
Response Theory and examine not only how predictive each question is regarding
the internal consistency to the test, but also examine the correlation with future
real-world market-performance metrics.
Essentially, our system is composed of two main parts that can also function independently:
the question generation and the question evaluation component. We
introduce the following novel aspects for question generation:
• By utilizing Q/A threads as question seeds, we can continuously update our question bank with up-to-date questions related to fast evolving technical topics.
• By using actual Q/A threads as inspiration, we are testing for concepts that are proven to be non-trivial in the real world.
• By leveraging Q/A threads into test-questions, we achieve much lower costs to generate a question compared to employing experts.
• By continuously monitoring the Internet for leaked questions, we can quickly eliminate opportunities for cheating.
We also introduce the following novel aspects for question evaluation:
• By utilizing exogenous ability metrics, such as wages, we evaluate questions as predictors of market performance metrics.
• By continuously evaluating the test-questions we also find questions that have been leaked, since such questions suddenly lose their ability to discriminate between users with different ability levels.
Tools & Workflow
1. Question Ingestion Component
2. Question Editor
3. Question Reviewer
4. Question Bank: Experimental and Production
5. Quality Analysis
6. Cheater Leaker
Question Generation Process
1. Stack Exchange
2. Question Spotter
Question Quality Evaluation
Our system can scalably generate a large number of questions for skill testing.This section discusses how our question evaluation
component works. The question analysis component generates a set of metrics to
evaluate the quality of the questions in the question banks. We compute these metrics
using standard methods from item-response-theory (IRT). IRT is a field of psychometrics
employed for evaluating the quality of tests and surveys that measure abilities,
attitudes, and so on.
Item Response Theory
1. Question Analysis based on Endogenous Metrics
2. Question Analysis based on Exogenous Metrics
3. Experimental Evaluation
All these Mathematical Theorms & Techniques based on Item Response Theory are used for the Quality Evaluation.
It presents a scalable testing and evaluation platform. The platforrm leverages content from user-generated question answering websites to continuously generate test questions, allowing the tests to be always “fresh” minimizing the problem of question leakage that unavoidably leads to cheating. System also shows how to leverage itemresponse-theory to perform quality control on the generated questions and, furthermore, platform should use marketplace-derived metrics to evaluate the ability of test questions to assess and predict the performance of contractors in the marketplace, making it even more difficult for cheating to have an actual effect in the results of the tests. One important direction for the future, is to build tests that have higher discrimination power for the top-ranked users than for the low-ranked ones. It is expected the use of adaptive testing to be useful in that respect, as can have tests that terminate early for the low-ranked users, while for the top-ranked users,may ask more questions, until reaching the desired level of measurement accuracy. Also, user may apply STEP for generating tests for for non-programming skills by leveraging non-technical Q/A sites, and even generate tests for MOOCs by analyzing the contents of the discussion boards, where students ask questions about the content of the course, the homeworks, etc. it is believed that such a methodology will allow the tests to be more tailored to the student population and that can measure better the skills that are expected in the marketplace.