WinterMilestone 12 Algorithmic HummingBirds

From crowdresearch
Jump to: navigation, search



It's been full 12 weeks into the 'Stanford Crowd Research Collective' and we inch closer to the internal deadlines of the UIST. As this is a one-of-its-kind online research collective, authorship works a little differently here; We summarise our contributions on a global spreadsheet and each will recieve a 100 credits to split among other potential contributors (more credit (no more than 15 per person) to someone with maximum influence) and we then run a page rank on this data (it downplays any internal influence of people trying to vote for each other in a decieved perception of being top authors undeserving-ly)

This week we build upon the platform by refining aspects of our research process by detailing down specific aspects of the core idea and configuring it for some particular functionality in the following three domains - task authorship, task ranking, open governance.

To summarise the discussion in all of the three domains:


The folks here considered reviews of workers (rated by volunteers) in order to study requester variance. A third pilot study is being launched where the goal is to study the variance with respect to quality and not task difficulty. The Mechanical Turk task is to write consumer review of ipod based rubric.

A rubric is a scoring tool that explicitly represents the performance expectations for an assignment or piece of work. A rubric divides the assigned work into component parts and provides clear descriptions of the characteristics of the work associated with each component, at varying levels of mastery.

We would have to work on task design such that it would be more heavily based on characteristics rather than on content.


The folks worked on tasks imported to Daemo and completion times of those tasks were estimated. We need to correlate the obtaining ratings with the Boomerang ranking system. The results of the pilot study are ready but are yet to be analysed at this point. The focus now is on implementing a feed study for requesters and started writing (we need to make sure that data supports the hypothesis first) a functionally complete Beta paper mainly focussed on literature survey and related work along with system descriptions for this week.


We need to consider guild scenarios in detail and how they play out with respect to new, experienced workers as well as requesters.


A lot of us were confused as to what guilds really mean or what they really do. The moment the topic came up, many of us had confused expressions like - “Guild what?” or “Guild? I thought we were talking about crowdsourcing”!!

So, let us resolve the guild issue by dwelling into little more detail. Now let us concentrate on a bottom up approach for dealing with the whole guild system where we first focus on the fundamentals as what they really are, what problem is it trying to solve, how would it fit into crowdsourcing and later we can advance into what would be the implementation details, what kind of pros and cons are inherent in such a system, so and so forth.

Guild would basically be defined as - a medieval association of craftsmen or merchants, often having considerable power. That is to say, guild in our context, would be a highly connected or networked group of people working together preferably in a single domain or on similar tasks.

Now, what problem is the guild really trying to solve? Guilds must be configured for different levels of members. So, good quality workers would be at the top of the guild which in itself could serve as a validator of the work. The guild could potentially look at task rejections (worker with high rating in the guild has done the work means that its unlikely he didnt put in required effort; and feedback on rejection is compulsory), mutliple iterations, fair wages (per industry standard), safeguard mechanism for creative work specially related to design prototypes (work reaches the requester directly; what's done within the guild essentially stays within the guild), work security (you compete and collaborate within the guild which essentially means within your domain), equal opportunity(as explored below), job satisfaction and so on.

Who can really form a guild? One of the suggestions said, 100 or more requesters author tasks in a similar or closely related field, it becomes a guild. One thing we could do is when a worker does well in a particular domain repeatedly and proves potential, he is added into a guild of that domain i.e.,' automatic user classification based on domain performance' . The user doesn't control the guild rather they evolve with the guild and don't define expertise amongst themselves.

Can we have multiple competing guilds?That idea would seem baseless on the surface but think of guilds where the population soars. So, we could have multiple guilds but we will have to ensure equal opportunities and constant shuffling of members to encourage collaboration.

So, all workers need to belong to the guild? I would say, not necessarily. They could move into and up a guild once they prove approved domain knowledge/experience. But if we treat guilds like a company or organisational structure its not collective open governance anymore.

Needless to say, this system can be implemented into Boomerang and Daemo in myraid ways. One possible variation could be as described below.

Consider a guild system to have a connected group of workers working in similar domain. Now when a task is being published first, we assume that there is an autotagging engine in place and the task is tagged. Firstly, how the autotagging feature would work is something like this – its built using some machine learning, artificial intelligence, neural networks and some other domains. We ask requesters to manually tag tasks initially, the machine “learns” it and then manual tagging is scraped, the machine automatically knows what domain a particular task belongs to. Now, its not possible or even correct to theoretically assume that this system is cent percent accurate. In order to work around this, we introduce a few design interventions. The default tag of any given task is general. If the autotagging system fails to recognise the domain of a particular task or the author specifies no explicit constraints as to the qualification of the worker who attempts the task then the task is open to any audience who wishes to attempt it (given the fact that Boomerang is going to be improvised in order to filter out spam workers by studying their user interaction). Now if the task is tagged to be under one or more domains, then we open up the task to a guild or the channel of that domain first. It moves out of the periphery only if the task is failed to be completed in the specified frame of time. An experiment in this direction may quantify or falsify the hypothesis about the effect on the output quality. The clear disadvantages are that one may think that there is unfair opportunity in distribution or restriction of task is prejudiced but let us assume for now, that all domains and general category tasks are more or less equal in number as they should eventually turn out to be. Also, what if a task requires collaboration between two or more domains or is tagged under multiple domains but dont really require that sort of a collaboration? These ideas are explored later in this document.

Talking about guilds and their computational capability, we can have requesters interact with one representative of a guild community (but does equal power distributions work better? What about guilds of a larger population?). Tasks are distributed among guild-ers. Collaborations and transparency is encouraged within the guild (of course, interactions need to be monitored to prevent cyber security or sharing answers issues).

Daemo workers can essentially visualise levelling up as they gain expertise and experience.

Using the whole system we could reproduce online or social learning on daemo?


Lets say we fix the rate for a task at $x/hr, would this affect the quality in a way, so as to say that a worker might rush and try to finish it earlier to focus on other tasks (paying more than $x/hr maybe) or would he try to slow down and push it so he gets more money for spending more time on the same task?

One option to resolve this is to crowdsource. The requester authors the task and then the tagging system tags the task. Now only the task description (brief) is released to the guild workers of the corresponding domains and they are asked “how much is the task worth” and lets suppose they say $x (this maybe considered as the average of guild and non guild workers depending on its implementation). Now lets ask the requester “how much is the task worth” an lets suppose he/she says $y. We now take an average of $x and $y and this is the price of the task. This would bring in the equal representation aspect of the Open governance forum which is the main essence of it.

Another variation could be to separately consider the guild and the non guild scenarios. If the guild members say $x and non guild members say $y and requesters say $z. The final price is either going to an average of all the three or say a ratio is going to be considered depending on the representation ratios.

One another option we could explore is as follows. When we ask them feedback about the amount of money a task is really worth, we could take these individually as in x1, x2, x3 .............. xn (within the guild) and then correlate each with the reputation of the worker within the guild. So, a worker at the top of the guild (with higher ranking) would have a better say in the platform than a worker who is ranked lower. This would motivate the workers to jump higher in the reputation rankings.

One other option is to fix the price at $x/hr. Then we time all the individual workers and then calculate average time spent on the task (we consider everyone (within the guild or outside) who has attempted the task). We correlate the quality and the time spent. Now we map the workers onto a scale with respect to the average. We now have to pay all those workers whose task has been accepted. We tweak x for the individual user based on how much time he/she spent and what is the quality of the work produced.


| | say 0.2 | --------------------------Worker#k | | say 0.03


| | say 0.5 |


Please note that the above attempt is not to scale and is for illustration purposes only.

Now there are 4 possible scenarios:

  • Poor quality & more time spent: This would be the worst case scenario
  • Poor quality & less time spent: This could be considered as the average cause and effect kind of scenario.
  • High quality & more time spent:This would be a relatively well paid scenario
  • High quality & less time spent:This could be considered as the best scenario

So, all the workers would try to optimise the time factor and not the quality and this would be the league that workers would work to fit into (which is exactly what we want).


Consider a subset of the tasks (which would be 100% of all prototype tasks); We ask the worker how much time he/she's actually spent.

Lets analyse this now.


  • Worker would feel represented because he gets a chance to explain his situation
  • this prevents speedster behaviour at least to some extent


  • Worker would feel the need to rush as the clock is ticking and it might become more like a video game.

One workaround for this (at least to some extent) is the hide and the unhide button of the timer

  • We might need to consider the impact on the cognitive load

However this is expected to be pretty low.

General Analysis Lets say you are a worker x. You belong to some guild system and you tried your hand at a task and the timer recorded a time of y but you felt it took you z (greater or lesser than y).

a. How do we verify z? (we could track but could be an invasion of privacy)

b. Do we really need z? (I say this, because, we already have an inbuilt workaround for this. The timer can anyway be paused when the worker is not working on it and the task is also diabled during this time)

c. For boomerang prediction model, are we going to take y or z or an average? If we take an average, how off is it going to be? How would the various parameters measure up in this case?

d. How exactly would this affect the worker? I understand that the worker's task feed would be optimised to his abilities to help him earn the maximum amount. Lets say we take y and build the task feed. Now Lets say we take z and build the task feed. Now Lets say we take average of y and z and build the task feed. How different are these really going to be? Is there a significant difference or would there be a negligible deviation?

BUT this doesnt exactly change their actual earnings. But, over/under reporting would make the whole system less useful to me. If we find (somehow) that this is consistently happening, we could block that worker from prototype tasks or something of the sort.

But we will know only by experimentation on different types of tasks, like dark matter, prototype, gold standard and so on.


one way would be to display the time you recorded say y. And ask the user if you got it right or was it greater or lesser and is so by how much and you compute z.

Rather than asking the worker for the exact value, we would ask for the range. That would be less precise, but the chances of him/her accidently getting it wrong would be minimised. We could give them options to choose from (these would be scientifically designed, equally spaced intervals)

We could separately report the averages of unedited time and edited time and a combination of the two.


We can configure the guild to clone the stack overflow kind of interface where it helps users manage complicated tasks using preestablished system into the guild. Essentially, the major guild would be composed of subguilds (which are just smaller units of guilds) which work in collaboration and also, be quite manageable.


There could be automatic threshold barricades that a worker needs to cross in order to get promoted.

x points, y tasks PROMOTION #i+1

x points, y tasks PROMOTION #i

     -------------------------------------------------------------- x' points, y' tasks         PROMOTION #i-1

             -------------------------------------------------------------- x points, y tasks  PROMOTION #i-2

and so on and so forth.

We could also have tasks to be reviewed and peer reviewed and then you get promoted based on the ranking and the points and badges recieved. Promotion here would be more like a social decision (they would be paid for it).

Or we could have automatic ranking systems where promotions are also pretty automatic (higher ranked) and would manifest as reputation within the system.

OR we could have third body decisions and a person performing well consistently would be considered for promotion.

This could be considered but we might need to make a collective decision on what we would choose maybe supported with substantial experimentation and data.


We could explore a professional association model who build and classify users like a skill tree where people are organised into a pyramid kind of structure where people with highest skill levels sit at the top of the pyramid and people with lowest skills (or new workers maybe) would come some where near the bottom. We could expect to find more people at intermediate levels than at the top or bottom. Of course, if thats true, then it wouldnt be a pyramid structure anymore, but the analogy is used here for illustration purposes only.


We said we would encourage collaboration within the guild. But wouldnt too many people being involved in a single task from one or more domains cause confusion and hence delay? Can we optimise some ratio in this regard? Can we implicitly or explicitly weigh the tags before optimising this ratio?


You give feedback to members within your guild and to members across domains with respect to a particular task you worked on. Then, we apply Boomerang within the guild where the feedback rating directly impacts your chances of working together with him/her.


In addition to having Boomerang along with Daemo, we are planning to throw in our insights from pilot studies and recent feedback in order to develop Daemo 2.0 or e-Daemo or enhanced Daemo. The new features it would account for would include 1. leveling for workers based on anonymous peer feedback or assessment of work quality (having effective mechanisms for dealing with anonymity would be a challenge) 2. level segmented worker pools with fair wage recommendations for requesters. This approach would help workers feel inclusive and represented in the platform design. 3. forums encouraging community development, sharing of best practices, and communication between share holder groups would take us a long way.


We have run a few pilot studies from all the three domains in order to study variances, get feedback on how the platform will eventually play out and so on. We might have to still get critical feedback for which we develop two different designs and ask workers to give us relative feedback.

Obviously feedback includes the pros and cons of the platform. But as they say,'A chain is only as strong as its weakest link'; We need to work on the weakest links and strengthen them and thats the reason only those have been focused here.

1. Workers said they are likely to choose to do reviews if they cost approximately a fare hourly wage for the experience level.

2. Workers felt that it is unfair to review for good of the system; However, reports on standardized rates to review tasks were inconclusive;

Accountability with respect to crowdworkers is not just the same as with the rest of the working profession and trying to hold virtual workers to same standards and structures will require at least double checked irradiation of anonymity.

However the concern is about maintaining anonymity between the reviewer and reviewee; This problem can probably be solved by mapping it as a weighted graph. All the people in the platform will be mapped as nodes and the relationship between them will be mapped as a line i.e., edge; number of messages they shared (on chat) will represent the weight of their relationship.

So, when we have to review the work of any worker A, then we use a randomization algorithm to pick out any other worker from the guild which A is associated with. Now let this worker be B. We now check the graph for the weight between the direct edge between A and B. Maybe, more the messages, more the probability that they are acquainted (even just professionally), lesser the anonymity. If the number of messages aren't too high then they are anonymous. We now check if level of B is greater than or equal to A.We ask B to review the work of A only if the condition that B is at a greater or equal level of A.If there are a lot a messages (more weight) or the above mentioned condition is not satisfied then, we use the randomization algorithm again and try to get another person and repeat the process. In order to make sure that off line communication is not involved (they might appear anonymous; they are not!), we might want to prefer people from geographically different locations.

Another optimal way of doing this, is when you have to review the work of A, reach out for him/her on the graph and find people with whom he/she has had least interactions like lower weight and get it reviewed by them.

3. Cheating aspect needs to be worked on. We will have to put in some security mechanisms and checks in place for authentication. People will eventually figure out what HiT's need to be sampled for promotion. In order to deal with this aspect, at least partially, peers who find co-workers who are cheating should be able to report anonymously (only with respect to the reported worker). Then a board of Representatives (of both communities) will investigate this allegation probably by studying all details. They will talk to the worker who reported him/her and try to figure out why he/she thinks so and what evidence they have and come to a conclusion about whether they should be blocked or demoted. If a requester finds a scammer, he can flag him from all his/her future works or refuse to pay him etc.

4. External factors like non focused workers may reduce the likelihood of Daemo being a successful platform. The above discussed point takes care of one aspect of this. As many issues would come to light only during testing of the design and so on, it is recommended that we involve in a 2 weekly randomized experimental study (separately for a guild and a non guild structure), to study the designs and see how it turns out. It can fail at the design itself or at the prototyping stage and so on, we would pull ourselves together and work it out reiteratively!!

5. In order to help workers feel respected, and accounted for, we allow workers to reject a task if he/she feels its priced too low with respect to the effort that they need to put in in order to complete it or with respect to the time that they need to spend. If enough number of people (greater than some threshold point) reject/report a task (with optional comments) it get automatically flushed out of their task feed and doesnt reappear until the requester republishes it with higher price or for a lower level etc.

6. Rating should be relative to the level and this needs to be taken care of as tele porting through levels maybe an issue. We need to correlate ratings and the level at which the task is open. This shouldn't pose an issue when it open only to workers at one level. When a task is open to workers at many levels, we could just take different ratings for different labels or some fixed representation format for the average.

7. There may be private tasks that needs to be taken care of as these cannot be shared (fixed or limited visibility with copyright maybe). We ask the requester if a particular task is private. If it is, we as him/her for the visibility. We form a virtual guild of all people who satisfy all the requisite conditions (maybe of a particular level, maybe belonging to a particular organization or with certain expertise etc.) and make the task visible to only them. And all information stays strictly within the guild network. Once the task is complete, then the requester can choose to dissolve this guild.

But What if we need selective visibility across levels? For example, assume that we have a complex task (involving multiple subproblems) where we need to audio transcribe a You tube video, then translate and summarize it.Instead of opening the entire task to one worker who would demand quite a fare, the requester would decide say, workers of level i would transcribe, level j workers would translate and workers of level k would summarize! This might work out to be cheaper.

We will have to ask the requester if he/she needs selective visibility and of what kind. We would need to develop an interface which would support this where we would ask him the number of levels and what needs to be visible at each level. Form virtual guilds for each level and proceed.

But if companies are going to pounce on this and use this platform for the purpose it is going to be highly ineffective and not serve the purpose it was truly meant for. To keep this aspect in check, we can limit the number of virtual guilds at any given point of time. Once the threshold is reached, a new virtual network can be formed only when an existing requester dissolves an existing system to make space for one more.

8. Workers would have the freedom and flexibility to choose to work in a guild or a non guild environment. Leveling could be problematic due to power concerns. Pitting people for competition outside the guild would defeat the purpose so, we would have to figure out an absolute metric for the purpose.

9. Review of topical tasks may not be straight forward so this aspect needs work.

10. level inflation issue if superior or good quality tasks are pitted high for people who have been earnestly working.

11. Pay We need to have a mechanism or a body in place to make sure payment reaches the appropriate person at appropriate time (say 1st of the month or something). Since this is a platform with global outreach there might be currency issues and so on. While many platforms have been using Pay Pal, we might explore other options like direct deposit etc. We could use IP/local GPS to geo-locate people and adjust it. But the problem is proxy systems. And could we pay based on accepted scales for example, people with higher standards of living might be paid slightly higher than other counterparts.

12. Requesters might be hard on new workers thus, strangling their confidence, leaving them depressed or frustrated. So, when a new worker attempts, his submission goes along with a 'new worker' tag. He is not pitted against workers at a level much greater than his. If the guild also could stand in support, it would be a wonderful thing!

13. As other crowd sourcing platforms like Turker Nation, Amazon Mechanical Turk have well established themselves in this sphere, migration to Daemo might be slightly offtrack. So, we need to have good documentation, clear policies, and of course, better pay to make a smooth transition.

14. If a requester specifies that it is open for a specific level or it requires a certain qualification, how is it justified? What happens to workers outside of the guild?

15. For requesters with lot of experience, do they level? Do they have monitored reputation? Is boomerang the answer?

'ADDITIONAL THOUGHTS 1. Can the mechanism of guilds be automated with the use of artificial intelligence where groups are formed around commonly repeated task patterns in addition to manually creating them. If you try to create a manual guild which auto exists it notifies you and doesn't allow it etc.

2.can we auto link profiles where your promotions show on your Linked In profiles?

3.We could have news feed system separate for general and guild/domain related which notifies you about the current research in the field, MOOC's available etc; This might help several workers jump levels, gain expertise and so on.


This part of the prototype mainly deals with open governance design aspects. Sincerely hoping that these methods suggested above would help build a better crowd platform, for a better world where crowd workers are represented and respected.