WinterMilestone 4 westcoastsfcr
We’ve synthesized some of the most popular ideas for each area. Grab at least one area, and an idea (not necessarily yours), and develop it further into a concrete research proposal! There are two type of proposals you can write, systems or science.
- What is the problem you are solving?
To improve the ranking of tasks so that workers can get the tasks related to them. Also, we would like the workers to get a ranking scheme based on which the requester can can give tasks.
- Related work
"Atelier: Repurposing Expert Crowdsourcing Tasks as Micro-internships" tried to improve the workers performance by using mentors. So the workers can improve their skills and get paid for it.
- What is the high-level insight?
We would like to provide a ranking scheme to the workers by such mentors based on which the workers can gain points on different skills they have. This ranking also will help the requesters gain trust and give their work in a more reliable manner to the worker. We could give the workers labels and give them scores for these labels/topics as they do in stackoverflow etc. The mentors could help in this process during their mentoring process as the workers learn new skills. This rating or score can be refined by the requester when the worker submits the task.
In recent years crowdsourcing has become common means for businesses to accomplish mundane and simplistic tasks that cannot be accomplished by computers. These tasks are known as “microtasks”, and typically require inherent knowledge that is hard to teach computers. An example would be editing a paper to make sure there no spelling or grammatical errors, as well shortening wordy sentences while still conveying originally intended meaning. Crowdsourcing has also made a possible to collect large amounts of data in a short amount of given. A researcher can upload a survey up to a crowdsourcing site, such as mechanical turk, pay each person who takes the survey a not so handsome fee and call it a day. These not so handsome fees can be as low as penny. It would make sense for these low forms of payment to be specific to certain tasks on these crowdsourcing sites, but it isn’t. Nearly all tasks pay a low fee and there is little variation among the payment for different tasks. Now while these tasks are mundane and simplistic it makes sense that the requester thinks paying the worker such a small fee is sufficient, but these tasks take time. And the only way to eventually get paid more, is to do an excellent job on a vast amount of low paying tasks to eventually be able to gain access to higher paying tasks. Under the assumption that these workers have no skills whatsoever it makes logical sense that the workers must prove themselves to eventually receive higher pay. Afterall this is how most entry level jobs functions. You get hired when there are very minimal requirements if any, you gain experience doing this same job over and again, and eventually you get a promotion or raise. But to make the assumption that all workers possess so few skills and such low intelligence that they need to perfect their ability to perform hits before getting paid higher is absurd. Many workers are capable of various skills, but how exactly can trust be instilled the requester that those working on their task are capable of producing acceptable work? As I mentioned above there is currently no way for workers to move up besides completing a vast amount of tasks that are high quality. I propose the implementation of placement exams. These placement exams would allow workers to prove themselves without having to go through a mass amount of tasks that are borderline insulting to their intelligence. How these placement exams would work is that when someone signs up as a worker on Daemo they would select from a list of categories what kind of tasks they’re interested in doing. They would then be presented with one placement exam for each category they marked. Based on their performance on the exam it would determine what tasks in these topics they have access to. Each worker for each category would be labeled a level from 1-10 that would determine what tasks they can do. For example, say they they marked python programming as one of their interests and then scored 100% on the exam they would then be labeled a level 10 python programmer. But if someone scored 30% on the placement exam they would be labeled a level 3 python programmer. People could later take more specific exams if they wished to level up from a level 3 python programmer to a level 4. There would need to be precautions taken to make sure that no cheating was to take place. The main issue is figuring out which method would be most effective.
Since the certification alone cannot validate your expertise, but also your performance in the real crowdsourcing tasks, we could make the score of the worker a combination of her/his results of placement exams as well as the number of tasks he performed correctly and got approved by a requestor.
People have tried various to prevent cheating through these crowdsourcing certification exams. One approach to this problem is the creation of STEP (Scalable Testing and Evaluation Platform). STEP uses experimental questions and production questions. The experimental questions make up 10-20% of the exam, but are not used to evaluate the user. After being ran past test takers that are sent to quality analysis to decide whether or not they should be on the exam or not. The production questions make up the other 80-90% and are what the test taker gets evaluated on and are periodically ran through quality analysis. This makes sure that questions are fair; not too easy and not too hard. This question also has a cheater leaker, which uses various search engines to look for the question on various websites. Once they located a question they examine it to see whether or not it actually contains the material that is on the exam. Once it is proven that this is actually the question, it is reworded significantly so that the test tasker cannot make search for the answer to the question so easily. The primary goal of this is to prevent users from searching for the answers to these questions and cheating. But in order for these questions to be generated they have a 5 workers sift through a QA thread and decide whether or not the questions are promising for creating question that will embody valuable concepts. Once this has been accomplished a classifier model is used to sift through the questions on the feed based on: number of votes for question and each answer, the entropy of the vote distribution among answers, number of comments, number of tags assigned, etc. These were then ran through a classifier that minimized the bad threads. This showed that a large number of upvotes is a negative predictor for questions, since they are usually about arcane topics (Ipeirotis). While this method nearly does solve the problem and creates quality tests, it is not financially feasible. This is an issue in the crowdsourcing market since is not ran a large corporation. In another study done by Molten Jr. et al. they examined the methods in which students typically perform cheating when taking online courses and ways to prevent it. This became an issue when online courses were first instantiated by universities. Testing no longer had supervision, thus made cheating easier than ever and all the more enticing. Of the various online cheating methods the most common and prevalent were: waiting for answers, fraudulent error messages, and Collusion. In order to reduce the frequency that this cheating occurred various defensive tactics were proposed. The first was having a strict test taking timeline. This would allow users only a short amount of time to answer the question, which would not be enough to search online for the answers. Another proposed idea was creating a cheating trap. This would be a website claiming to have the answers to the questions, but would provide false information. If a student put in these answers it would be evident that they had been cheating. Randomizing exam questions and answer was tactic that did exactly what it says. This made it so not all students would have the same exam. This commonly seen in college exams, where there will be various versions of the exam created. And lastly, there was statistical analysis to detect common error. This analyzed the EEIC (exact wrongs) and EIC (questions answered incorrectly. It stated that if the EEIC/EIC ration was above 0.75 then the students were considered cheating. This method was later looked at and changed to revolve around the probability index. This suggest that if a probability index < .001 is obtained than the students are cheating since it is less than one one thousandth chance these wrong answers came about independently. These methods are all excellent ways to prevent cheating from an occur, but when exactly is the right time and way to implement them?
Bernstein et al proposed a system called Atelier to lower the amount of effort needed for tasks on crowdsourcing marketplaces. They repurpose the tasks with mentoring called ‘micro-internships’. This platforms breaks down the task into smaller milestones which are solved by the ‘crowd interns’ with guidance and mentorship. So the problem is solved together by the mentor and the intern. They also conducted a study which proves that Atelier helped interns develop new skills and gain real world experience. This system would be a good approach to both the requestors and the workers. Sometimes when the workers are not sure how to proceed with a task, they generally give up and search for another task. So the worker loses opportunity and also the requester will not get his work done. With this system, the workers have the opportunity to seek help and guidance from the mentor. In this way, the workers not only accomplish the task but also gain expertise in new areas. So overall, this is a good system which accomplishes tasks by building trust among people.
Christaforaki, Maria. Ipeirotis, Panagiotis G. “STEP: A Scalable Testing and Evaluation Platform.” New York University. Moten Jr., J., Fitterer, A., Brazier, E.,Leonard, J., and Brown, A. “Examining Online College Cyber Cheating Methods and Prevention Measures”. Electronic Journal of e-Learning Volume 11 Issue 2 2013. Suzuki, R., Salehi, N., Lam, M.S., Marroquin, J.C., and Bernstein, M.S. “Atelier: Repurposing Expert Crowdsourcing Tasks as Micro-internships”. University of Colorado Boulder, Stanford University. 2016.
- What's the system?
- What's the phenomenon you're interested in?
- Preventing the leaking of placement/certification exam question and from getting leaked.
- The puzzle
- Making this possible without recreating the exam constantly, since that is not feasible.
- The experimental design
- This study involves 100 participants. One-hundred participants will be separated into each group randomly.
- Each group will consist of 50 men and woman of varying ages.
- The age range of the participant pool ranges from 18-65 years old.
- Two different tests will be given to each group. However each test will cover similar material at the same difficulty level.
- The first group will consist of 50 participants. During the exam, their screens will not be locked. This will allow them to open other internet tabs.
This is to observe the prevalence of cheating.
- The second group will also consist of 50 participants. During the exam, their screens will be locked.
This will prevent them from opening up any other internet tabs while taking the exam.
- Two web sites will be created before-hand that has all of the questions that will be asked on each of the exams. The web-site with the answers for the group with the unlocked screens will all be incorrect. That way if the participant answers the question with the wrong answer provided on this site, we know they cheated. The web-site with the answers for the group with the locked screens will all be correct. This site has been created so if anyone tries to search the question on the internet during the exam, they will be led to this website. Then during the exam we can track the IP address used to access the site and compare it with the IP address of each participant taking the exam. That way if a participant is using another wi-fi enabled device we would be able to tell that they were cheating. We would also be able to tell how many people cheated and whether or not locking the screens made a difference at all by comparing the number of exam questions answered correctly by each group and the number of cheaters in each group.
- In general, both exams will also have weeder questions. If the weeder questions get an 80% or higher success rate they can be switched out with questions from the question bank that cover the same concept for future experiments or in a real-life test.
- This will allow us to see how affective replacing weeder questions is/if it's necessary.
- The result
- There will be a stronger curve when peoples' screens are locked.
When someone signs up for Daemo they choose topics of interest for the tasks they wish to partake in. C.S., Math, Engineering, etc. For each of these topics they have to take a placement exam to see how skilled they are. if they get < 10% & > 20% they are level one, < 20% & > 30 % they are level two, etc. Keep in mind this is just the initial placement exam.
1. When you start taking the exam it will lock so that you can't leave the page without leaving the exam (No searching the web for information on that console)
2. Each question will be one minute long
3. If a certain amount of people are getting a certain question right (say it has an 80% success rate) it can be reviewed to see whether or not it is just a simple question or it is actually a question that is more complex
- a We could possibly give the questions some sort of tag so that it could be sent to an algorithm and if it receives a certain tag it pulls a new question from a pool of questions that covers the same concept. i.e. Say we make 100 questions for each topic and each exam is only 20 questions so we could interchange the weeder questions whenever the success rate gets above 80%.
- b If it is simple it doesn't get's replaced
4. I believe these exams should be given in chunks similar to how that BAR and MCAT are...well all standardized and certification testing I guess.
5. After they have been on Daemo for a while and feel that they should be able to access higher level tasks they can take certification exams specific to the levels, but they must be taken in sequential order to give less of an incentive to cheat. i.e. a level three python programmer can't take an exam to be a level six python programmer without take 4 & 5 first.
6. These "Level Up Exams" will follow the same system except they must be taken sequentially.
7. If someone performs poorly on three tasks within a certain level of task categorization they get bumped down a level
8. This helps prevent requesters from receiving poor work and gives workers less of an incentive to cheat.
9. You can only retake a "Level Up Exam" a maximum of three times.
Future Considerations, Future Work, and Potential Solutions
- Limitation on the amount of time for the questions: variable-length questions might be better for ascertaining ability, since we can have longer, harder questions for experts. We can include questions that all of people get right, which corresponds to the ability of a Level 1. It could possibly be more reflective of the ability of the test taker.
- How can ensure the quality of the exam questions? If the level of the worker is dependent on how many questions they get right on the test. There's likely to be some questions that are easier than others, which is not necessarily a bad thing, so there might not necessarily a need to swap it out.
- What type of questions would we ask the test taker? How would those questions and relate to the level they are designated? (i.e. would a level 1 question for a programming language placement test ask about syntax?)
- How would you facilitate the types of questions the workers that make the questions would ask? (i.e. they could all ask difficult questions that only experts would know. So we are more likely to get Level 10 experts and Level 1 Novices rather than a more evenly distributed spread)
- If others who are at a higher level are creating the new questions. How can we be sure that those questions are appropriate for each "leveling up" quiz? And if experts are making the questions, they might not realize how complex a question could be since it could be considered very simple to them since they have a large amount of experience in the area.
- Potential punishments for people who cheat. This punishments must be heavy enough to prevent people from considering cheating as an option. (i.e. banning from the platform - possibly too heavy of a punishment, suspension for a certain amount of time)
- How can we determine when someone is cheating or if someone is promoting cheating? People could post answers on the Internet anonymously, etc.
- If we figure out someone is posting answers for these tests online, how will they be punished?
- Interviewer type setting: Person monitoring test takers to see if they're cheating or not. And also, potentially for the workers to ask questions about the the exam. Test takers could be grouped up and paired with a proctor.
- What types questions? - free answer, multiple choice is more conducive to cheating (taken multiple times_
- Limitations on the number of times a workers can take the tests, similar to the limitation of the "Level Up" exam. Limitations based on time frame, i.e. 1 test a week. Can the initial placement test be only taken once? How long before the worker can take other test to level up?
- And then if someone is retaking a level up exam, how can we insure that they are being asked different questions than they were originally presented with? If they are taking the exam for the second or third time with the same exact questions, then they might resort to memorization.
- Instead of basing the level of the worker on the amount they got correct, we could have tests for however many levels we want, from the beginning, not until after they've been on the platform for a while. So we have a Level 1 test for people who want to reach Level, Level 2 test for those who want to reach Level 2, etc. Problems with this would include, more questions to write, people who fail a higher level have to take a lower level test again.
- What does each level correspond to in terms of ability? How do we decide this among the many fields there are to certify? To decide what amount of ability that a worker on the platform of a certain level corresponds to, before the platform launches, we could do some kind of round table discussion on a forum or a survey, or hire a variety of workers in their respective fields and talk to them about their level of expertise.
- We could have a pool a of questions, and just randomly provide a worker with a random set from that pool. How big is this pool? How often does is get new questions? Do we replace the pool occasionally or just add new questions to the pool?
- Categorization within the placement tests, so a fair distribution of questions is representative of the field.
- Warnings for the worker is about to be bumped down a level, rather than abruptly bumping them down a level.
- Possibilities for the worker to level up in that field based on the work that they do. (Tests are limited in time, and tend to make people nervous; some workers might not do as well as others on the basis of tests)
@alexstolzoff, @dmajeti, @clao, @tgotfrid7