WinterMilestone 1 YourTeamName

From crowdresearch
Jump to: navigation, search



It's been full 12 weeks into the 'Stanford Crowd Research Collective' and we inch closer to the internal deadlines of the UIST. As this is a one-of-its-kind online reseach collective, authorship works a little differently here; We summarise our contributions on a global spreadsheet and each will recieve a 100 credits to split among other potential contributors (more credit (no more than 15 per person) to someone with maximum influence) and we then run a pagerank on this data (it downplays any internal influence of people trying to vote for each other in a decieved perception of being top authors undeservingly`)

This week we build upon the platform by refining aspects of our research process by detailing down specific aspects of the core idea and configuring it for some particular functionality in the following three domains - task authorship, task ranking, open governance.

To summarise the discussion in all of the three domains:


The folks here considered reviews of workers (rated by volunteers) in order to study requester variance. A third pilot study is being launched where the goal is to study the variance with respect to quality and not task difficulty. The Mechanical Turk task is to write consumer review of ipod based rubric.

A rubric is a scoring tool that explicitly represents the performance expectations for an assignment or piece of work. A rubric divides the assigned work into component parts and provides clear descriptions of the characteristics of the work associated with each component, at varying levels of mastery.

We would have to work on task design such that it would be more heavily based on characteristics rather than on content.


The folks worked on tasks imported to Daemo and completion times of those tasks were estimated. We need to correlate the obtaining ratings with the Boomerang ranking system. The results of the pilot study are ready but are yet to be analysed at this point. The focus now is on implementing a feed study for requesters and started writing (we need to make sure that data supports the hypothesis first) a functionally complete Beta paper mainly focussed on literature survey and related work along with system descriptions for this week.


We need to consider guild scenarios in detail and how they play out with respect to new, experienced workers as well as requesters.


A lot of us were confused as to what guilds really mean or what they really do. The moment the topic came up, many of us had confused expressions like - “Guild what?” or “Guild? I thought we were talking about crowdsourcing”!!

So, let us resolve the guild issue by dwelling into little more detail. Now let us concentrate on a bottom up approach for dealing with the whole guild system where we first focus on the fundamentals as what they really are, what problem is it trying to solve, how would it fit into crowdsourcing and later we can advance into what would be the implementation details, what kind of pros and cons are inherent in such a system, so and so forth.

Guild would basically be defined as - a medieval association of craftsmen or merchants, often having considerable power. That is to say, guild in our context, would be a highly connected or networked group of people working together preferably in a single domain or on similar tasks.

Now, what problem is the guild really trying to solve? Guilds must be configured for different levels of members. So, good quality workers would be at the top of the guild which in itself could serve as a validator of the work. The guild could potentially look at task rejections (worker with high rating in the guild has done the work means that its unlikely he didnt put in required effort; and feedback on rejection is compulsory), mutliple iterations, fair wages (per industry standard), safeguard mechanism for creative work specially related to design prototypes (work reaches the requester directly; what's done within the guild essentially stays within the guild), work security (you compete and collaborate within the guild which essentially means within your domain), equal opportunity(as explored below), job satisfaction and so on.

Who can really form a guild? One of the suggestions said, 100 or more requesters author tasks in a similar or closely related field, it becomes a guild. One thing we could do is when a worker does well in a particular domain repeatedly and proves potential, he is added into a guild of that domain i.e.,' automatic user classification based on domain performance' . The user doesn't control the guild rather they evolve with the guild and don't define expertise amongst themselves.

Can we have multiple competing guilds?That idea would seem baseless on the surface but think of guilds where the population soars. So, we could have multiple guilds but we will have to ensure equal opportunities and constant shuffling of members to encourage collaboration.

So, all workers need to belong to the guild? I would say, not necessarily. They could move into and up a guild once they prove approved domain knowledge/experience. But if we treat guilds like a company or organisational structure its not collective open governance anymore.

Needless to say, this system can be implemented into Boomerang and Daemo in myraid ways. One possible variation could be as described below.

Consider a guild system to have a connected group of workers working in similar domain. Now when a task is being published first, we assume that there is an autotagging engine in place and the task is tagged. Firstly, how the autotagging feature would work is something like this – its built using some machine learning, artificial intelligence, neural networks and some other domains. We ask requesters to manually tag tasks initially, the machine “learns” it and then manual tagging is scraped, the machine automatically knows what domain a particular task belongs to. Now, its not possible or even correct to theoretically assume that this system is cent percent accurate. In order to work around this, we introduce a few design interventions. The default tag of any given task is general. If the autotagging system fails to recognise the domain of a particular task or the author specifies no explicit constraints as to the qualification of the worker who attempts the task then the task is open to any audience who wishes to attempt it (given the fact that Boomerang is going to be improvised in order to filter out spam workers by studying their user interaction). Now if the task is tagged to be under one or more domains, then we open up the task to a guild or the channel of that domain first. It moves out of the periphery only if the task is failed to be completed in the specified frame of time. An experiment in this direction may quantify or falsify the hypothesis about the effect on the output quality. The clear disadvantages are that one may think that there is unfair opportunity in distribution or restriction of task is prejudiced but let us assume for now, that all domains and general category tasks are more or less equal in number as they should eventually turn out to be. Also, what if a task requires collaboration between two or more domains or is tagged under multiple domains but dont really require that sort of a collaboration? These ideas are explored later in this document.

Talking about guilds and their computational capability, we can have requesters interact with one representative of a guild community (but does equal power distributions work better? What about guilds of a larger population?). Tasks are distributed among guild-ers. Collaborations and transparency is encouraged within the guild (of course, interactions need to be monitored to prevent cyber security or sharing answers issues).

Daemo workers can essentially visualise levelling up as they gain expertise and experience.

Using the whole system we could reproduce online or social learning on daemo?


Lets say we fix the rate for a task at $x/hr, would this affect the quality in a way, so as to say that a worker might rush and try to finish it earlier to focus on other tasks (paying more than $x/hr maybe) or would he try to slow down and push it so he gets more money for spending more time on the same task?

One option to resolve this is to crowdsource. The requester authors the task and then the tagging system tags the task. Now only the task description (brief) is released to the guild workers of the corresponding domains and they are asked “how much is the task worth” and lets suppose they say $x (this maybe considered as the average of guild and non guild workers depending on its implementation). Now lets ask the requester “how much is the task worth” an lets suppose he/she says $y. We now take an average of $x and $y and this is the price of the task. This would bring in the equal representation aspect of the Open governance forum which is the main essence of it.

Another variation could be to separately consider the guild and the non guild scenarios. If the guild members say $x and non guild members say $y and requesters say $z. The final price is either going to an average of all the three or say a ratio is going to be considered depending on the representation ratios.

One another option we could explore is as follows. When we ask them feedback about the amount of money a task is really worth, we could take these individually as in x1, x2, x3 .............. xn (within the guild) and then correlate each with the reputation of the worker within the guild. So, a worker at the top of the guild (with higher ranking) would have a better say in the platform than a worker who is ranked lower. This would motivate the workers to jump higher in the reputation rankings.

One other option is to fix the price at $x/hr. Then we time all the individual workers and then calculate average time spent on the task (we consider everyone (within the guild or outside) who has attempted the task). We correlate the quality and the time spent. Now we map the workers onto a scale with respect to the average. We now have to pay all those workers whose task has been accepted. We tweak x for the individual user based on how much time he/she spent and what is the quality of the work produced.


| | say 0.2 | --------------------------Worker#k | | say 0.03


| | say 0.5 |


Please note that the above attempt is not to scale and is for illustration purposes only.

Now there are 4 possible scenarios:

  • Poor quality & more time spent: This would be the worst case scenario
  • Poor quality & less time spent: This could be considered as the average cause and effect kind of scenario.
  • High quality & more time spent:This would be a relatively well paid scenario
  • High quality & less time spent:This could be considered as the best scenario

So, all the workers would try to optimise the time factor and not the quality and this would be the league that workers would work to fit into (which is exactly what we want).


Consider a subset of the tasks (which would be 100% of all prototype tasks); We ask the worker how much time he/she's actually spent.

Lets analyse this now.


  • Worker would feel represented because he gets a chance to explain his situation
  • this prevents speedster behaviour at least to some extent


  • Worker would feel the need to rush as the clock is ticking and it might become more like a video game.

One workaround for this (at least to some extent) is the hide and the unhide button of the timer

  • We might need to consider the impact on the cognitive load

However this is expected to be pretty low.

General Analysis Lets say you are a worker x. You belong to some guild system and you tried your hand at a task and the timer recorded a time of y but you felt it took you z (greater or lesser than y).

a. How do we verify z? (we could track but could be an invasion of privacy)

b. Do we really need z? (I say this, because, we already have an inbuilt workaround for this. The timer can anyway be paused when the worker is not working on it and the task is also diabled during this time)

c. For boomerang prediction model, are we going to take y or z or an average? If we take an average, how off is it going to be? How would the various parameters measure up in this case?

d. How exactly would this affect the worker? I understand that the worker's task feed would be optimised to his abilities to help him earn the maximum amount. Lets say we take y and build the task feed. Now Lets say we take z and build the task feed. Now Lets say we take average of y and z and build the task feed. How different are these really going to be? Is there a significant difference or would there be a negligible deviation?

BUT this doesnt exactly change their actual earnings. But, over/under reporting would make the whole system less useful to me. If we find (somehow) that this is consistently happening, we could block that worker from prototype tasks or something of the sort.

But we will know only by experimentation on different types of tasks, like dark matter, prototype, gold standard and so on.


one way would be to display the time you recorded say y. And ask the user if you got it right or was it greater or lesser and is so by how much and you compute z.

Rather than asking the worker for the exact value, we would ask for the range. That would be less precise, but the chances of him/her accidently getting it wrong would be minimised. We could give them options to choose from (these would be scientifically designed, equally spaced intervals)

We could separately report the averages of unedited time and edited time and a combination of the two.


We can configure the guild to clone the stack overflow kind of interface where it helps users manage complicated tasks using preestablished system into the guild. Essentially, the major guild would be composed of subguilds (which are just smaller units of guilds) which work in collaboration and also, be quite manageable.


There could be automatic threshold barricades that a worker needs to cross in order to get promoted.

x points, y tasks PROMOTION #i+1

x points, y tasks PROMOTION #i

     -------------------------------------------------------------- x' points, y' tasks         PROMOTION #i-1

             -------------------------------------------------------------- x points, y tasks  PROMOTION #i-2

and so on and so forth.

We could also have tasks to be reviewed and peer reviewed and then you get promoted based on the ranking and the points and badges recieved. Promotion here would be more like a social decision (they would be paid for it).

Or we could have automatic ranking systems where promotions are also pretty automatic (higher ranked) and would manifest as reputation within the system.

OR we could have third body decisions and a person performing well consistently would be considered for promotion.

This could be considered but we might need to make a collective decision on what we would choose maybe supported with substantial experimentation and data.


We could explore a professional association model who build and classify users like a skill tree where people are organised into a pyramid kind of structure where people with highest skill levels sit at the top of the pyramid and people with lowest skills (or new workers maybe) would come some where near the bottom. We could expect to find more people at intermediate levels than at the top or bottom. Of course, if thats true, then it wouldnt be a pyramid structure anymore, but the analogy is used here for illustration purposes only.


We said we would encourage collaboration within the guild. But wouldnt too many people being involved in a single task from one or more domains cause confusion and hence delay? Can we optimise some ratio in this regard? Can we implicitly or explicitly weigh the tags before optimising this ratio?


You give feedback to members within your guild and to members across domains with respect to a particular task you worked on. Then, we apply Boomerang within the guild where the feedback rating directly impacts your chances of working together with him/her.


In addition to having Boomerang along with Daemo, we are planning to throw in our insights from pilot studies and recent feedback in order to develop Daemo 2.0 or e-Daemo or enhanced Daemo. The new features it would account for would include 1. leveling for workers based on anonymous peer feedback or assessment of work quality (having effective mechanisms for dealing with anonymity would be a challenge) 2. level segmented worker pools with fair wage recommendations for requesters. This approach would help workers feel inclusive and represented in the platform design. 3. forums encouraging community development, sharing of best practices, and communication between share holder groups would take us a long way.


We have run a few pilot studies from all the three domains in order to study variances, get feedback on how the platform will eventually play out and so on. We might have to still get critical feedback for which we develop two different designs and ask workers to give us relative feedback.


This part of the prototype mainly deals with open governance design aspects. Sincerely hoping that these methods suggested above would help build a better crowd platform, for a better world where crowd workers are represented and respected.