Winter Milestone 6 Algorithmic Hummingbirds

From crowdresearch
Jump to: navigation, search



It's been 6 weeks into the 'Stanford Crowd Research Collective'. This week we build upon the research process of week 5 by detailing down specific aspects of the core idea and configuring it for some particular functionality in the following three domains - task authorship, task ranking, open governance.

To summarize the discussion in all of the three domains:


The synthesis on task authorship includes how much of the variation with respect to task design and obtained results is contributed to by the requester and the worker respectively and what kind of interventions can help in this regard.

The goal here really is to align and converge visions about what the tasks are, how we measure quality (subjective feedback) and explore auto-grading policies.

We thought of the variance with respect to the worker and the requester. But, are there any external factors or dependencies which affect task quality and affect this variance? For example, do workers who use platforms like Daemo and others for a livelihood (like people who have lost well paying jobs due to some reason or who are physically challenged (differently-abled) etc.) necessarily do better than people who use the platform as a source of side income?

How would we scale requester variance? a. We look at tasks that can be binarily evaluated as in correct/incorrect, true/false etc. Tested with pre-generated set of possible answers used for auto-grading and then expand further once the foundations of the system are firmly set.

b. how much requester variance exists with respect to very common tasks?

c. We look at systems designed to extract answers from a page.

d. We also looked at some sort of a back off strategy with application of sentiment analysis to data (for which we can take advantage of our own crowd and check empirical variance before testing outside this crowd body). For this purpose we use square bench marked tasks (tasks used to standardize crowd-sourcing algorithms in a way so as to say). So, can poorly designed tasks give us an insight into human interpretation and psychology which could be recursively used to refine task design, understand the workers better and also, can this data provide crucial information to other research domains? But some of the major challenges would be to store such huge amount of data and apply inferential theories on them (which might require preprocessing in order to bring it into a structured form) and also the overhead involved in making it work not just for people of a particular geography or literacy but for a global generation.

An experiment that was done having many requesters author several different tasks and then have them rate the submissions with respect to which authoring brought results closest to expected output and study how authorship affects the experience of different types of users. Variations of the above general idea include:

a. have a group of 10 workers and 10 requesters where each requester authors 5 tasks which takes a total of 50 tasks and now each worker attempts all the tasks and then study the variance using this data.

b. consider a group of 20 requesters and 4 random primitive tasks. Increase the level of guidance on each of these tasks. That is to say, 0 level of guidance on the first task, 1st level of guidance on the second task, etc. Now use this data to study variance and author tasks better.

We could further explore the idea of having design templates (preferably different ones for different types of tasks). We could experiment with two groups, one of which uses this unique feature and other which doesn't and see if result quality varies significantly. Further, we could allow the requester to reuse his previous designs and templates for future tasks.

But the underlying primitive question is – can we stick to these fundamentals and design the whole outline or the structure of Daemo or do we need to base it on more complicated or complex ideas?

To answer a part of the above question, with respect to tasks, simple tasks will have more or less little variation (no matter how little guidance you give and how badly you put your point across as a requester, an average worker will be able to figure it out). This authoring should be critical in case of more difficult tasks. However, when we conduct a study of this sort, we should be considering all sorts of tasks with varying levels of difficulty (accounting for a relative perspective). Note that we identify good and bad requesters using linear regression coefficient, r^2.

As far from our experience with Amazon Mechanical Turk and other similar platforms, we know that workers have certain qualification requirements to attempt certain types of tasks. Would it make sense to have similar qualifications or experience constraints for requesters as well? So, only a domain expert (with some specified or rather, verified qualifications or work experience in the domain)can post an advanced task. So, this could potentially discourage uninterested requesters who do not follow up on task (Worker would have wasted a lot of time given its hard and is in a specific domain). Would that improve requester quality?


We looked at designing an interactive task feed and what it might look like? And what data would feed it? Would it be algorithms or machine learning or artificial intelligence?

Many static features of task feed including category, texts, price etc. Has already been used to match worker profiles and base recommendation systems on these foundations. So, what can we do differently in Daemo? (We could certainly look at the dynamic features including user interaction statistics etc.)

Some of the suggestions include a. Workers over all rating determined by peer feedback. So, requesters can use this information or rating to find appropriate workers for their tasks. But, the prevailing problem is how do we ensure fairness? For example, its my personal experience that I haven't faired very well in MOOC's (Massive open online courses) involving peer feedback due to inexplicable reasons or reasons still mysterious to me. Can we instead consider an average of peer rating and requesters rating?

b. prioritize HITs that workers successfully completed in the past. As an addition to this, we can also generalize or extrapolate to tasks in similar or closely related domains.

c. we could base the feed on requester task quality, wages, interest

d. Feed for requesters giving them early access to workers they rated high previously.

e. We can have the workers tag tasks while they author it. Instead, could we use deep convolutional machine learning techniques along with artificial neural networks and use some previous or a pre-generated data as a training set and build an auto-tagging system integrated with Boomerang for Daemo?

f. We extend Boomerang to choose workers with high ranks in the given task category.


Refine the guild to a minimalistic structure designed in such a way that it has features testable by a conference.

With respect to the open governance a lot has to be discussed with the guild structures, right from what is its definition, what problem is it trying to solve, the implementation details and so on.


The current generation of crowd-sourcing platforms are surprisingly flawed, which are often overlooked or sidelined.

So, the idea is to channelize efforts in the direction of guilds and experiment to see to what extent this helps in minimizing some of the issues by building a next generation crowd-sourcing platform, Daemo integrated with a reputation system, Boomerang. The improvised platform is expected to yield better results in limited time bounds (as compared to existing platforms) and definitely, more efficient and representative crowd platforms.

Authors Keywords

Crowd-source; Daemo; Boomerang; Guilds; sub guild systems


Crowd-sourcing, a typically defined as the process of obtaining services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers. So, what is it exactly that happens on these crowd sourcing platforms? For example, Amazon Mechanical Turk (popularly MTurk) is a crowd-sourcing Internet marketplace that enables individuals and requesters to coordinate the effective use of human intelligence to perform tasks with the help of Human Intelligence Tests (HITs) that computers are currently unable to do. In the current scenario, neither are the requesters able to ensure high quality results, nor are the workers able to work conveniently. The current generation of crowd sourcing platforms like Task Rabbit, Amazon Mechanical turk and so on, do not ensure high quality results, produce inefficient tasks and suffer from poor worker-requester relationships. In order to overcome these issues, we propose a new standard (the next de-facto platform), Daemo which includes Boomerang, a reputation system that introduces alignment between ratings and likelihood of collaboration.


A lot of us were confused as to what guilds really mean or what they really do. The moment the topic came up, many of us had confused expressions like - “Guild what?” or “Guild? I thought we were talking about crowd-sourcing”!!

So, let us resolve the guild issue by dwelling into little more detail. Now let us concentrate on a bottom up approach for dealing with the whole guild system where we first focus on the fundamentals as what they really are, what problem is it trying to solve, how would it fit into crowd-sourcing and later we can advance into what would be the implementation details, what kind of pros and cons are inherent in such a system, so and so forth.

Guild would basically be defined as - a medieval association of craftsmen or merchants, often having considerable power. That is to say, guild in our context, would be a highly connected or networked group of people working together preferably in a single domain or on similar tasks.

Now, what problem is the guild really trying to solve? Guilds must be configured for different levels of members. So, good quality workers would be at the top of the guild which in itself could serve as a validator of the work. The guild could potentially look at task rejections (worker with high rating in the guild has done the work means that its unlikely he didn't put in required effort; and feedback on rejection is compulsory), mutliple iterations, fair wages (per industry standard), safeguard mechanism for creative work specially related to design prototypes (work reaches the requester directly; what's done within the guild essentially stays within the guild), work security (you compete and collaborate within the guild which essentially means within your domain), equal opportunity(as explored below), job satisfaction and so on.

Who can really form a guild? One of the suggestions said, 100 or more requesters author tasks in a similar or closely related field, it becomes a guild. One thing we could do is when a worker does well in a particular domain repeatedly and proves potential, he is added into a guild of that domain i.e.,' automatic user classification based on domain performance' . The user doesn't control the guild rather they evolve with the guild and don't define expertise amongst themselves.

Can we have multiple competing guilds?That idea would seem baseless on the surface but think of guilds where the population soars. So, we could have multiple guilds but we will have to ensure equal opportunities and constant shuffling of members to encourage collaboration.

So, all workers need to belong to the guild? I would say, not necessarily. They could move into and up a guild once they prove approved domain knowledge/experience. But if we treat guilds like a company or organizational structure its not collective open governance anymore.

Needless to say, this system can be implemented into Boomerang and Daemo in myriad ways. One possible variation could be as described below.

Consider a guild system to have a connected group of workers working in similar domain. Now when a task is being published first, we assume that there is an auto-tagging engine in place and the task is tagged. Firstly, how the auto-tagging feature would work is something like this – its built using some machine learning, artificial intelligence, neural networks and some other domains. We ask requesters to manually tag tasks initially, the machine “learns” it and then manual tagging is scraped, the machine automatically knows what domain a particular task belongs to. Now, its not possible or even correct to theoretically assume that this system is cent percent accurate. In order to work around this, we introduce a few design interventions. The default tag of any given task is general. If the auto-tagging system fails to recognize the domain of a particular task or the author specifies no explicit constraints as to the qualification of the worker who attempts the task then the task is open to any audience who wishes to attempt it (given the fact that Boomerang is going to be improvised in order to filter out Spam workers by studying their user interaction). Now if the task is tagged to be under one or more domains, then we open up the task to a guild or the channel of that domain first. It moves out of the periphery only if the task is failed to be completed in the specified frame of time. An experiment in this direction may quantify or falsify the hypothesis about the effect on the output quality. The clear disadvantages are that one may think that there is unfair opportunity in distribution or restriction of task is prejudiced but let us assume for now, that all domains and general category tasks are more or less equal in number as they should eventually turn out to be. Also, what if a task requires collaboration between two or more domains or is tagged under multiple domains but don't really require that sort of a collaboration? These ideas are explored later in this document.

Talking about guilds and their computational capability, we can have requesters interact with one representative of a guild community (but does equal power distributions work better? What about guilds of a larger population?). Tasks are distributed among guild-ers. Collaborations and transparency is encouraged within the guild (of course, interactions need to be monitored to prevent cyber security or sharing answers issues).

Daemo workers can essentially visualize leveling up as they gain expertise and experience.

Using the whole system we could reproduce online or social learning on daemo?


We can configure the guild to clone the stack overflow kind of interface where it helps users manage complicated tasks using pre-established system into the guild. Essentially, the major guild would be composed of sub-guilds (which are just smaller units of guilds) which work in collaboration and also, be quite manageable.


We could explore a professional association model who build and classify users like a skill tree where people are organized into a pyramid kind of structure where people with highest skill levels sit at the top of the pyramid and people with lowest skills (or new workers maybe) would come some where near the bottom. We could expect to find more people at intermediate levels than at the top or bottom. Of course, if thats true, then it wouldn't be a pyramid structure anymore, but the analogy is used here for illustration purposes only.


We said we would encourage collaboration within the guild. But wouldn't too many people being involved in a single task from one or more domains cause confusion and hence delay? Can we optimize some ratio in this regard? Can we implicitly or explicitly weigh the tags before optimizing this ratio?


You give feedback to members within your guild and to members across domains with respect to a particular task you worked on. Then, we apply Boomerang within the guild where the feedback rating directly impacts your chances of working together with him/her.


This part of the paper mainly deals with open governance design aspects specially guilds. Sincerely hoping that these methods suggested above would help build a better crowd platform, for a better world where crowd workers are represented and respected.