WinterMilestone 8 Algorithmic HummingBirds
SUMMARY OF WINTER MEETING 8
We have a little less than a month to go (April 3) before the internal deadlines of the UIST paper submissions and probably a month after that for CSCW deadlines. So, the goals and discussion this week in all of the three domains focuses on the finer aspects of the design and implementation.
1. TASK AUTHORSHIP
The folks ran a pilot study with crowd members actively contributing to author tasks (image tagging of restaurants, sentiment analysis of twitter feeds or classifying web pages as pro marijuana or against it, summarise you tube videos, snapping texts out of pages on the web and so on so as to say more specifically) which was later released on Amazon Mechanical Turk.
The obtained results were later parsed and analysed. The lessons learnt are as follows: a. the variance due to requester in a crowd platform is pretty low b. simple tasks are found to be robost compared to task which rank at intermediary or advanced levels (this might be probably because, no matter how badly a simple task is authored, how ambiguous the task may sound, the details should be fairly obvious even for an average worker which is not the case with intermediate/advanced tasks as they would require clear instructions and considerable interpretation or some experience on the part of the worker) c. priming (requester authors the task and then gets some feedback and adds additional details, reruns the task and analyses incoming results, while brainstorming the study design process)
The problem with the above approach is that we don't have base answers or ground truth functions for these. These may have scope for some assumptions and interpretations. We could probably rank workers in order to downplay this effect to some extent but elimination with this kindof workaround would be too ambitious. We could hire experienced requesters to judge these task results but again, we would have to pay them and this can't be an individual activity (one requester rates one worker) so, we will have a panel of requesters rate the particular task ('x' evaluations) but we would have to compute interrater reliability metrics and so on. Another possible workaround could be to use ground-truth based dark matter tasks but this is a context variant approach (we could possibly configure it for specific guilds however!).Could we use supervised learning techniques where system itself might interpret needs and make considerable low-level interpretations? (would it make life easier for the average worker? And also ensure better quality results?)
Goal for this week (week 8) is to run a second pilot study.
2. TASK RANKING
The goal this week is to write out a specific system description of extended Boomerang and it's effect on the task feed.
We aligned a specific incentive based informative task feed and the various aspects intertwined with it are -
a. Reputation If you are a requester (you influence which worker gets the first access to your future tasks). If you are a worker (you influence which requesters show up at the top of your news feed). The aspects of these would be an individual perspective (you have better colleagues for example, in this case) and the global social perspective (which in this case would be that reputation more directly impacts at more or less all levels of incentivisation).
b. Hourly rate It estimates the earnings per hour. Once the task is completed, it asks the workers how long it took them and their answer directly impacts them at both global and local levels. Once 4 or more workers (that's the sample size) attempt the task, it estimates the amount of time it takes to complete it. If less than 4 workers have attempted the task, we take the mean of the time reports and prototype task data. The individual incentive is that the soul of the model is based on worker responses The global win is that worker responses (visible only to the individual) are an intergral part of the estimates produced (globally visible – So, we are here talking about selective data hiding mechanisms if we dig into a deeper backend point of view which is vital in prototype tasks).
The problem with the hourly rate approach is that, we are looking only at the number of workers attempting the task and not really correlating each of them with the quality of output that they produced ( without even knowing if the work is accepted successfully or rejected) or without taking their relative ranking within or outside the guild system in the Boomerang into consideration.
c. Rejection information
It estimates the percentage of workers getting future tasks based on past rejections. The individual incentive is that more a requester accepts a worker's tasks, the earlier access they are granted to his future tasks. Global win is that global rejection rate is expected to drop as workers with low motivation levels won't attempt tasks they know they will be rejected for. (Boomerang literally!)
It results in an organic worker requester pairing demonstrated by number of tasks offered to the same worker by the same requester over a period of time. Natural pairings for different workers and requesters will be generated over time if they have not worked together in the past.
Previous work in this domain are task recommender systems which optimise attributes like task design, price, availability, and intrinsic interests. There are also algorithms which have been developed to estimate how frequently work needs to be checked for it's quality and so on.
' One possible approach is to include additional information like quality, likelihood of rejection, effective wages etc., which could be found by incentivisation and compute scores for cascading and task ranking to be augmented with other additional details. So, we scale our time estimates according to prior estimates by the paricular worker. '
can we possibly have timers for each tasks? For example, as soon as a worker accepts a particlar task the timer starts and as soon as the submit button is pressed, a timer script runs which stops the time. This would prevent inaccurate estimates, also ensuring higher sample size, hence greater accuracy. A new task will restart the timer. We could consider control options to pause or stop timers (for example, a guest unexpectedly drops in while a worker is working etc.). So, we study the workers dynamic interaction for about 3 minutes or so, if there is no interaction, timer is automatically paused and automatically resumes when interaction resumes. The error levels are hence expected to be pretty low. We then represent the average time taken and the sample size. If timer for each task proves too much of an overhead (not to mention, at this scale), we could take the time from the user system at beginning of the task and at the end, and then a difference of the two (but system could be maniputed). We could work the same way with server time .'
But if a worker repeatedly gets rejected by a requester, he is defranchised. This is specially to rid the spammers.
We take into consideration 3 categories – suggested (avg of what you did best in and boomerang feedback to preserve overall simplicity), preffered (convergence of interest domains) and latest (relative ranking of perfomance).
3. OPEN GOVERNANCE
Guild is a socio technical infrastructure with a built in competative component which increases scability and trust, reliability and hence, human interaction through boomerang feedback. Although, we all agree that this solves a part of the problem, other problems are best tackled by automated system services. So, we could use a mechanism where we use an automated guild system with some human interventions wherever required.
We could use probablistic agreement checks on work as a proxy for quality levels in order to reduce rejection levels (although this increases time and cost)
As the guild would be a highly competitive environment, it is going to be hard for workers to keep their motivation levels up all the time specially people who already have part time or even full time jobs. So, we introduce some financial incentives.
The goal for this week is going to be story boarding the entire design process.
The current generation of crowd-sourcing platforms are surprisingly flawed, which are often overlooked or sidelined.
So, the idea is to channelize efforts in the direction of guilds and experiment to see to what extent this helps in minimizing some of the issues by building a next generation crowd-sourcing platform, Daemo integrated with a reputation system, Boomerang. The improvised platform is expected to yield better results in limited time bounds (as compared to existing platforms) and definitely, more efficient and representative crowd platforms.
Crowd-source; Daemo; Boomerang; Guilds; sub guild systems
“BUILD THE GUILD”
A lot of us were confused as to what guilds really mean or what they really do. The moment the topic came up, many of us had confused expressions like - “Guild what?” or “Guild? I thought we were talking about crowdsourcing”!!
So, let us resolve the guild issue by dwelling into little more detail. Now let us concentrate on a bottom up approach for dealing with the whole guild system where we first focus on the fundamentals as what they really are, what problem is it trying to solve, how would it fit into crowdsourcing and later we can advance into what would be the implementation details, what kind of pros and cons are inherent in such a system, so and so forth.
Guild would basically be defined as - a medieval association of craftsmen or merchants, often having considerable power. That is to say, guild in our context, would be a highly connected or networked group of people working together preferably in a single domain or on similar tasks.
Now, what problem is the guild really trying to solve? Guilds must be configured for different levels of members. So, good quality workers would be at the top of the guild which in itself could serve as a validator of the work. The guild could potentially look at task rejections (worker with high rating in the guild has done the work means that its unlikely he didnt put in required effort; and feedback on rejection is compulsory), mutliple iterations, fair wages (per industry standard), safeguard mechanism for creative work specially related to design prototypes (work reaches the requester directly; what's done within the guild essentially stays within the guild), work security (you compete and collaborate within the guild which essentially means within your domain), equal opportunity(as explored below), job satisfaction and so on.
Who can really form a guild? One of the suggestions said, 100 or more requesters author tasks in a similar or closely related field, it becomes a guild. One thing we could do is when a worker does well in a particular domain repeatedly and proves potential, he is added into a guild of that domain i.e.,' automatic user classification based on domain performance' . The user doesn't control the guild rather they evolve with the guild and don't define expertise amongst themselves.
Can we have multiple competing guilds?That idea would seem baseless on the surface but think of guilds where the population soars. So, we could have multiple guilds but we will have to ensure equal opportunities and constant shuffling of members to encourage collaboration.
So, all workers need to belong to the guild? I would say, not necessarily. They could move into and up a guild once they prove approved domain knowledge/experience. But if we treat guilds like a company or organisational structure its not collective open governance anymore.
Needless to say, this system can be implemented into Boomerang and Daemo in myraid ways. One possible variation could be as described below.
Consider a guild system to have a connected group of workers working in similar domain. Now when a task is being published first, we assume that there is an autotagging engine in place and the task is tagged. Firstly, how the autotagging feature would work is something like this – its built using some machine learning, artificial intelligence, neural networks and some other domains. We ask requesters to manually tag tasks initially, the machine “learns” it and then manual tagging is scraped, the machine automatically knows what domain a particular task belongs to. Now, its not possible or even correct to theoretically assume that this system is cent percent accurate. In order to work around this, we introduce a few design interventions. The default tag of any given task is general. If the autotagging system fails to recognise the domain of a particular task or the author specifies no explicit constraints as to the qualification of the worker who attempts the task then the task is open to any audience who wishes to attempt it (given the fact that Boomerang is going to be improvised in order to filter out spam workers by studying their user interaction). Now if the task is tagged to be under one or more domains, then we open up the task to a guild or the channel of that domain first. It moves out of the periphery only if the task is failed to be completed in the specified frame of time. An experiment in this direction may quantify or falsify the hypothesis about the effect on the output quality. The clear disadvantages are that one may think that there is unfair opportunity in distribution or restriction of task is prejudiced but let us assume for now, that all domains and general category tasks are more or less equal in number as they should eventually turn out to be. Also, what if a task requires collaboration between two or more domains or is tagged under multiple domains but dont really require that sort of a collaboration? These ideas are explored later in this document.
Talking about guilds and their computational capability, we can have requesters interact with one representative of a guild community (but does equal power distributions work better? What about guilds of a larger population?). Tasks are distributed among guild-ers. Collaborations and transparency is encouraged within the guild (of course, interactions need to be monitored to prevent cyber security or sharing answers issues).
Daemo workers can essentially visualise levelling up as they gain expertise and experience.
Using the whole system we could reproduce online or social learning on daemo?
CLONING STACK-OVERFLOW KIND OF SYSTEM INTO GUILDS
We can configure the guild to clone the stack overflow kind of interface where it helps users manage complicated tasks using preestablished system into the guild. Essentially, the major guild would be composed of subguilds (which are just smaller units of guilds) which work in collaboration and also, be quite manageable.
We said we would encourage collaboration within the guild. But wouldnt too many people being involved in a single task from one or more domains cause confusion and hence delay? Can we optimise some ratio in this regard? Can we implicitly or explicitly weigh the tags before optimising this ratio?
BOOMERANG WITHIN THE GUILD
You give feedback to members within your guild and to members across domains with respect to a particular task you worked on. Then, we apply Boomerang within the guild where the feedback rating directly impacts your chances of working together with him/her.
POSSIBLE STORY BOARDING APPROACHES'
A new worker joins. He is new to the entire system, so we produce enough documentation to help him through the entire process and make his journey easy and smooth. He doesn't belong to any guild as of now. As he keeps working on these tasks, he track his interests and the domains he seems to excel at (compete at times above the average, attempting tasks labelled advanced and so on). We soon flood him or rather, notify him with the various guild systems that are open to him. He can either agree and join the guild or he can choose to stay away from it completely or he can choose to get a few notifications about it occassionally as he has not yet made up his mind. If he joins a guild, then he continues to function normally (agrees implicitly to the terms and conditions of the guild). But his work quality is expected to improve over time as competition thrives, and he will be subjected to random checks or evaluations. If he is found to be functioning correctly within the guild, his reputation increases. If his functioning is below the expectations, then his reputation decreases. This fluctuation in reputation directly impacts his pay. They are ranked relatively based on this, and people at the lower half fighting it out are subjected to more frequent checks than higher ranked counterparts. Of course, he is always open to the option to quit the guild. If he choses to stay away from the guild, he continues to function normally. To ensure price and social compatibility, non guilders will have to compete with guilders. But they work more in isolation than in collaboration and may have mutiple domain interests (which would mean way too many guilds). There will be relative rankings outside the guild as well with lower orders earning lesser than their higher ranked peers. When a new task comes in, it is given to the guild and also, to all workers previously demonstating interest in the tagged domain.We could have rungs of promotions which are well defined at each level so, they will be specific milestone jumps for the cluster. Whoever completes a given task would be given due credit for it.
When a new worker joins the platform, we produce enough documentation to help him through the entire process and make his journey easy and smooth. He fills in a profile with all his interests and we suggest guilds he could be a part of. He choosn no explicit non guild members. They can work in isolation within the guid itself simulating non guild behaviour. That gives the system much of flexibility. He is added at the end of the ranking, and he is put through optional training with sample tasks and so on. This could either be for a specific period of 2 weeks or it can be an open system, where you just have to complete the material in any specified amount of time in order to move on. Then he is open to tasks in the domain. It is up to him to fight his way to the top.
This week really helped us to understand and gain clarity about the entire system (from minute details to big pictures) from many perspectives (as a back end about the development, front end implementation, as a worker and requester and so on)