WinterMilestone 5 Algorithmic HummingBirds
SCIENCE PROPOSAL – TASK AUTHORSHIP
IMPROVING REPUTATION SYSTEMS FOR CROWD-SOURCING PLATFORMS
SUMMARY OF WINTER MEETING 5
We build upon the research process of week 4 by detailing down specific aspects of the implementation of the specific ideas and organizing them into full fledged research proposals in the following three themes - task authorship, task ranking, open governance as we inch closer to the conference deadlines.
To summarise the discussion in all of the three domains:
1. TASK AUTHORSHIP
We looked at experiment on task quality and price where non experts do a task and then experts revise it would be cheaper than only experts doing a task.
We also looked at releasing a task to a specific domain of workers at lower price and then increasing the price or opening it up to a larger domain of people if it is not satisfactorily completed within the given time frame which is also the underlying idea of Boomerang. But the concerns of this approach would be that we would be limiting the range so as to say, for example, that 80% of the tasks were successfully completed by the smaller domain within the stipulate time, then wouldn't we essentially be denying the larger community of an opportunity which would potentially be unfair?
We were also concerned about workers and requesters feedback to each other which would be a crucial component of a new platform like Daemo.
Also we thought about certification tasks to define worker proficiency where we could probably make use of gold standard tasks or use sentiment analysis on tasks by workers in order to accurately determine their proficiency. We could also tab out statistics on how many tasks the user signed up for, how many did he/she successfully complete, how much time did it take and so on.
We could also examine the influence of requester on task quality because no one has even accounted for the variance introduced by the requester and the result of a particular task is almost always entirely attributed to the worker. So, is the variance introduced by the requester really negligible? For example, we could consider - how would novice and experienced participants respond to poorly designed tasks and so on.
We could look at testing the effectiveness of adding categorical tags and time frame estimations to tasks in order to hopefully, improve the quality of the results that the user would generate.
We could also design customized templates for workers and requesters and see if this reduces the time consumption of designing tasks or working on them.
We might also have to seriously consider the repercussions of ambiguous instructions which is actually a major issue considering that the definition of ambiguous is relative and task-specific.
2. TASK RANKING
We looked at categorizing tasks according to a large number of attributes like skills, task difficulty, interest, timing patterns and so on. We will need to modify the boomerang system accordingly in order to database required statistics and prioritize incoming tasks accordingly for each user.
However, on the requesters side, it would helpful if they could see some information about the workers so, they can author their task better to target and appeal to a potential sector of workers. This information could include, who are the top workers in this domain, what are their success percentages, what is the kind of expertise or experience or what kind of qualifications do they have? So on and so forth.
We could also bring in new pricing models where the requesters fix in on a base price but also take into account factors like experience, geography, rejection rate and maybe, modify the base price accordingly.
We would have to also look at macro tasks in very specific domains, say, XML problems or debugging of android API's (Application Programming Interface which is a set of subroutines and protocols used to facilitate the process of software application development)and also, its breakdown into logical components to be assigned to different users and also later integration of the individual tasks. We might have to give them overlapping modules in order to connect them at a later point of time and also find a way to reduce this redundancy.
Other interesting ideas that were generated include configuring the platform itself to prevent negative outcomes where probably too few keystrokes or interaction would mean that the worker probably didnt put in enough effort so the probability that the result is wrong is high. Also, could we extend this into domains of education as well? We could have the requesters “pay to learn” which would be an opportunity to get some hands on experience in some domain through which we could have requesters offer jobs or maybe evaluate work and get the intern university credits. We would also have the intern work with an experienced worker instead of a requester.
Could we also look at a larger social cause through this platform? You know, the concept of motivation crowding. The motivation crowding theory suggests that external or extrinsic motivators like money could undermine internal or intrinsic motivators like passion. Here, we are exploring the option of having someone do the work and they could choose to donate the money they earn to a general or a specific social cause where the funds are routed using a specific internal mechanism.
3. OPEN GOVERNANCE
We could also have workers communicate with each other to get them to understand the task better and also provide peer feedback prior to submission to the requester which would be more effective because it would not require requester to scale help for evaluation. But maybe, this is not as simple as it sounds. We may need to implement the stack overflow kind of a model but the requester will have to moderate discussions because workers could potentially end up confusing each other making the results go haywire or discuss the answers itself. This may mean that the requester is getting spammed and we may need to go back on our idea and just have the requester clarify directly to the workers. We might need some kind of a leadership dashboard consisting for workers, requesters and Daemo platform engineers for effective running of the platform in general and mechanisms in particular. Also, with respect to peer feedback, we may need to restrict this to workers of the same domain and of similar expertise level because review across all levels and depths would make no sense. But now, can level 1 and level 2 workers work together to evaluate level 3 workers? If yes, then what is the specific mechanism in place for such a collaboration and how is the discussion going to be monitored? We could group workers working under a specific domain or for a particular requester or even group requesters domain-wise which would allow them to focus on their respective needs.
We could use A|B test for developing or configuring guilds for different organizational structures. We could synthesize this by refining the idea to build a model testable by CSCW (ACM Conference on Computer Supported Cooperative Work).
The current generation of crowd-sourcing platforms are surprisingly flawed, which are often overlooked or sidelined and some of these include poor interface design,difficulty in task hunting, unfair payment systems, poor worker-requester communication and representation and so on.
So, the idea is to channelize efforts in this direction by solving or at least, minimizing some of the issues by building a next generation crowd-sourcing platform, Daemo integrated with a reputation system, Boomerang. The improvised platform would focus on auto tagging or auto classification of tasks by using machine learning and artificial intelligence algorithms and clustering through guild systems and other issues. The approach is expected to reduce redundancy, ensure better communication not only between requesters and workers but also among co-requesters and co-workers which would yield better results in limited time bounds (as compared to existing platforms) and definitely, more efficient and representative crowd platforms.
Crowd-source; Daemo; Boomerang;
Crowd-sourcing, a typically defined as the process of obtaining services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers. So, what is it exactly that happens on these crowd sourcing platforms? For example, Amazon Mechanical Turk (popularly MTurk) is a crowd-sourcing Internet marketplace that enables individuals and requesters to coordinate the effective use of human intelligence to perform tasks with the help of Human Intelligence Tests (HITs) that computers are currently unable to do. In the current scenario, neither are the requesters able to ensure high quality results, nor are the workers able to work conveniently. The current generation of crowd sourcing platforms like Task Rabbit, Amazon Mechanical turk and so on, do not ensure high quality results, produce inefficient tasks and suffer from poor worker-requester relationships. In order to overcome these issues, we propose a new standard (the next de-facto platform), Daemo which includes Boomerang, a reputation system that introduces alignment between ratings and likelihood of collaboration. The results we would be hoping to achieve would be - to prove that increased communication between requesters and workers and also among co-requesters and co-workers yields higher quality results because the efforts of either side would hopefully be channelized; new pricing models may attract more workers. We also look at accounting for variance in task quality with respect to requesters.
BRIDGING THE COMMUNICATION GAP
As suggested above (and by me, in the milestone 3 submission - []), requesters and workers could go through an intermediary, who holds payment until work is completed and task is verified. The involvement of the third party would ensure fairer payment systems (and hopefully better involvement and higher quality results). But could this possibly crowd sourced as well (where professional requesters and professional workers are chosen and they are together integrated into the system and together decide on the payment policy of a particular task ensuring the best interests of everyone)?? We could experiment on such a system first, and correlate (and extrapolate) with the results obtained!
Does communication really correlate to a much better results?? Possible way of ensuring that the expectations of the results are met would possibly be - to experiment. For example, consider two groups of online crowd workers of similar levels of experience, possibly interest, age, gender considerations. Now, one group has all the communication privileges where the workers are free to communicate to the requesters and vice versa and also among themselves. The other group, however, cannot communicate directly to peers or requesters. Following this working procedure under specified conditions , for sufficient amount of time, tabulating results separately and independently and finally comparing the two should prove that communication yields to lesser rejections, more happy workers (and requesters), and hence, high quality results.
DEALING WITH UNFAIR REJECTIONS
In order to deal with unfair rejections, we need to build a system, resistant to spammers in order to convince the requesters that maybe, the worker did actually try his/her best but maybe the task was not just clear enough or some other technical issue was persistent. So, when a new worker joins the platform, he has enough resources and documentation to ensure that he/she is well acquainted with the interface, the policies of the platform that's being adopted and so on. So, the new worker is first thrown with simpler tasks or gold standard tasks (where the answers are already known) and his behavior is being tracked in real time with respect to number of keystrokes per minute, and the movement of the mouse etc. His/Her answers are matched with the known solutions and ranking is assigned to that worker. Consistently low scores on the part of the worker may result in him/her being flagged. (Here, we are actually hoping that the interaction is not heavily dependent on the nature of the task or the task in itself) Whenever a requester marks a task as easy (or any particular task is low-paid), it first reaches the top of the task feed of the new user. As the worker gains experience (that is to say, he/she successfully completes x number of HIT's, he/she climbs up the ranking feed and now is assigned intermediary or difficulty level tasks). This ensures that new workers don't have to unfairly compete with professionals. In order to ensure that the rejection is justified, once the HIT is submitted, it is done so along with the statistics and so, the requester is thoroughly convinced that it was not the lack of effort on the part of the worker. In case he decides to reject the work, it is mandatory for him to justify as to why the work was rejected. He can or cannot allow the worker to re-attempt the task as per his discretion. The worker, in case feels he/she is wrongly being flagged, may appeal to the intermediary whose decision shall be final and binding.
'ACCOUNTING FOR REQUESTER VARIANCE
There are two types of variance – the variance due to the worker (Was he having a bad day?) and variance due to requester (How effectively was the task authored?). To study worker variance, we could crowd source tasks to a varying group of people with distinctly different attributes and obtain data and perform data analytics to obtain hypothesis. But more often than not, the variance due ot requester is ignored. So, we are going to look at partitioning this variance so as to say, X% of the variance is due to worker and Y% of the variance is due to requester. And then study if X>Y, X<Y, X=Y or Y is completely negligible with respect to X and what this could really mean.
Now can we think of linear regression? Consider a plot where x axis represents requester quality and y axis represents result quality. Now what kind of a graph can we expect and why? Is it as simple as y=mx+c where regression coefficient is r^2 ~ 1 (which could account for 100% of the variance)? Note that greater the variance lesser, lesser the grand mean can explain. We need to explain variance not across one but across 2 variables. So, think of multilevel regression.
Take in top quality requesters (with experience, qualification, who have generated high percentage of expected results (which would been they are effective at task authoring)) which would ensure that this variance is minimized. Now, using this, study worker variance. Let us say, this explains X% of the total variance. Now, take in top quality workers, (with qualifications and high percentage of success in prototyping or gold standard tasks) so worker variance is minimized. Now study their interaction with all sorts of tasks and task design, let us now say, it explains Y% of the total variance. Combining the results, we can explain some Z% of the variance, which has some relation to X and Y which needs to be studied through experimentation. Or, you serialize, where you study, worker variance first (X) and then both (X and Z) now, study X, Y and Z.
We might be able to reuse a few methods and ideas that we already generated. Given a task, we prototype it and then launch two separate versions of the task. One is the original task itself and the other is the prototyped version of the task. On any incoming worker, they are randomly assigned to one of the versions of the two tasks. Over a period of time, we obtain two groups of submissions. We then study the obtained results versus the required results. In such an experiment, needless to day, the prototyped tasks actually did well. Picking tasks for such an experiment might be quite tricky. We could however choose tasks from Amazon Mechanical Turk as these have been around for quite some time and dependencies and variances are controlled.
In order to effectively study the variance correlation, we could look at the following approaches.
We could have the same task be worded differently (this could be crowd sourced or maybe even outsourced at nominal pay) without altering the semantic meaning, and assign these tasks to different uses or varying expertise level and study their interaction with the same. Over a period of time we could hope to potentially rather eventually, find a pattern or some hypothesis using which we could extrapolate to a larger population accounting for generated variance.
Given any task, we could try to convert the given statement to some mathematical form (using open source tools like Dafny) to reduce ambiguity (and answer questions along the way explicitly). So, when the task is assigned to a group of workers, we are sure that ambiguity is reduced to a great extent so, if a worker gets wrong results, we can be more convinced that it was due to lack of effort on his/her part.
We form task and them show examples, sample input output forms and screen-shots so, workers have better understanding of what is expected of them and we might almost be assured of better quality results. Failure to get required output may almost assuredly point towards lack of effort on the workers side. We can even take screen-shots of worker output and directly image match internally to determine if they got it right or not so, it saves time and energy in terms of evaluation.
Upto 60% of the people in the world know more than 1 language (40% know 1 language, 43% know 2, 13% know 3, 3% know more than 4, less than 1% know 5). So, we could translate the given task into another language using crowd sourcing procedures or using Google translate (which ever works better. I am personally expecting the former to work better) so, it would result in better understanding of the task and hence, better results hopefully.
So, the requisite procedure of going about the entire process, would be to design a task, crowd source to ensure better task design and then rework some parts of it and then get work done, verify and match results, compile data and then perform some data analysis on this to get better task design.
Task -> task design (recursive) -> get work done -> verify -> compilation -> data analytics
This part of the paper mainly deals with task quality design aspects, like pricing models and failure analysis and also, requester variance. Pricing models is fixed with the help of an intermediary body (consisting of representations of both parties) in consultation with the requester and his team (if any). And failure to complete the given task or failing to meet expectations, can be accounted for by communication between the involved people within the system. We account for variance generated by requesters and generate less ambiguous instructions to ensure higher result quality. Sincerely hoping that these methods suggested above would help build a better crowd platform, for a better world where crowd workers are represented and respected.