WinterMilestone 9 AlgorithmicHummingBirds

From crowdresearch
Jump to: navigation, search

SUMMARY OF WINTER MEETING 9

The goals and discussion this week in all of the three domains focus on the refinement of front-end design and the back-end implementation.

1. TASK AUTHORSHIP

The goal was to try and figure out why pilot study results were not really favorable. A second pilot study was launched and all the folks together authored one dark matter task and this has shown high variance. The task dataset has been released to requesters. If variance comes out to be low then the plan is to design a high quality HiT's covering all the edge cases and then rank something like

-- -- --- --- --- ---- -- -- -- HIGH QUALITY


--- --- --- -- -- -- - -- - - - --- AVERAGE QUALITY



- - - - - - - - --- -- - - - POOR QUALITY

For those in another location unable to reach Mechanical Turk can reach out to their counter parts in other parts of the world and use their logins.

For the next pilot study, a gold standard task is to be authored and we can then study the role of variance here.

2. TASK RANKING

The goal last week was to write out a specific system description of extended Boomerang and it's effect on the task feed. It was all about designing the information gathering aspect of the front end interface and also think about its back end implementations. These folks have been doing a great job working on the timer aspects and data collection mocks, back end rejection rate, continued refinement of system and methodologies.

3. OPEN GOVERNANCE

The goal is to aim and pull out a concrete design.

4. DESIGN TEST FLIGHT

We have a timer to time the approximate amount of time it takes to complete a particular task and Boomerang would use this information to sort all the future task to maximum benefit of the end user.

As of now, the features of the timer are as follows:

  • It can be hidden so it is not booming at the worker resulting in unnecessary pressure leading them to complete quickly with low quality which would affect their reputation within the system.
  • It can be paused at which the task is also disabled.
  • It starts as soon as the task is accepted and stops as soon as the task is submitted,
  • Workers can alter the duration time after submitting the tasks. However this alteration can't happen after requester accepts the submitted task. This is just used as a mechanism to “feedback” to requesters.
  • A value statement saying why exactly we need the time and what we are doing with it.

The goal this week is to work on data models, back end, front end etc.

ABSTRACT

The current generation of crowd-sourcing platforms are surprisingly flawed, which are often overlooked or sidelined.

So, the idea is to channelize efforts in the direction of guilds and experiment to see to what extent this helps in minimizing some of the issues by building a next generation crowd-sourcing platform, Daemo integrated with a reputation system, Boomerang. The improvised platform is expected to yield better results in limited time bounds (as compared to existing platforms) and definitely, more efficient and representative crowd platforms.


Authors Keywords

Crowd-source; Daemo; Boomerang; Guilds; reputation; timer


“TIME & PAY ”

Lets say we fix the rate for a task at $x/hr, would this affect the quality in a way, so as to say that a worker might rush and try to finish it earlier to focus on other tasks (paying more than $x/hr maybe) or would he try to slow down and push it so he gets more money for spending more time on the same task?

One option to resolve this is to crowd source. The requester authors the task and then the tagging system tags the task. Now only the task description (brief) is released to the guild workers of the corresponding domains and they are asked “how much is the task worth” and lets suppose they say $x (this maybe considered as the average of guild and non guild workers depending on its implementation). Now lets ask the requester “how much is the task worth” an lets suppose he/she says $y. We now take an average of $x and $y and this is the price of the task. This would bring in the equal representation aspect of the Open governance forum which is the main essence of it.

Another variation could be to separately consider the guild and the non guild scenarios. If the guild members say $x and non guild members say $y and requesters say $z. The final price is either going to an average of all the three or say a ratio is going to be considered depending on the representation ratios.

One another option we could explore is as follows. When we ask them feedback about the amount of money a task is really worth, we could take these individually as in x1, x2, x3 .............. xn (within the guild) and then correlate each with the reputation of the worker within the guild. So, a worker at the top of the guild (with higher ranking) would have a better say in the platform than a worker who is ranked lower. This would motivate the workers to jump higher in the reputation rankings.

One other option is to fix the price at $x/hr. Then we time all the individual workers and then calculate average time spent on the task (we consider everyone (within the guild or outside) who has attempted the task). We correlate the quality and the time spent. Now we map the workers onto a scale with respect to the average. We now have to pay all those workers whose task has been accepted. We tweak x for the individual user based on how much time he/she spent and what is the quality of the work produced.


- - - - -- -- ----- -- -- - -Worker#i ABOVE AVERAGE


| | say 0.2 | - -- ---- -- - - - -- --- --- -- - -- -Worker#k | | say 0.03



--- -- --- -- -- -- -- --- AVERAGE CASE


| | say 0.5 |


--- -- --- - ---- - - -- -- - Worker#c BELOW AVERAGE



Please note that the above attempt is not to scale and is for illustration purposes only.

Now there are 4 possible scenarios:

  • Poor quality & more time spent: This would be the worst case scenario
  • Poor quality & less time spent: This could be considered as the average cause and effect kind of scenario.
  • High quality & more time spent:This would be a relatively well paid scenario
  • High quality & less time spent:This could be considered as the best scenario

So, all the workers would try to optimize the time factor and not the quality and this would be the league that workers would work to fit into (which is exactly what we want).

“TIMER”

Consider a subset of the tasks (which would be 100% of all prototype tasks); We ask the worker how much time he/she's actually spent.

Lets analyze this now.

Pros:

  • Worker would feel represented because he gets a chance to explain his situation
  • this prevents speedster behavior at least to some extent

Cons:

  • Worker would feel the need to rush as the clock is ticking and it might become more like a video game.

One workaround for this (at least to some extent) is the hide and the un-hide button of the timer

  • We might need to consider the impact on the cognitive load

However this is expected to be pretty low.

General Analysis Lets say you are a worker x. You belong to some guild system and you tried your hand at a task and the timer recorded a time of y but you felt it took you z (greater or lesser than y).

a. How do we verify z? (we could track but could be an invasion of privacy)

b. Do we really need z? (I say this, because, we already have an inbuilt workaround for this. The timer can anyway be paused when the worker is not working on it and the task is also disabled during this time)

c. For boomerang prediction model, are we going to take y or z or an average? If we take an average, how off is it going to be? How would the various parameters measure up in this case?

d. How exactly would this affect the worker? I understand that the worker's task feed would be optimized to his abilities to help him earn the maximum amount. Lets say we take y and build the task feed. Now Lets say we take z and build the task feed. Now Lets say we take average of y and z and build the task feed. How different are these really going to be? Is there a significant difference or would there be a negligible deviation?

BUT this doesn't exactly change their actual earnings. But, over/under reporting would make the whole system less useful to me. If we find (somehow) that this is consistently happening, we could block that worker from prototype tasks or something of the sort.

But we will know only by experimentation on different types of tasks, like dark matter, prototype, gold standard and so on.

OTHER FEEDBACK MECHANISMS

one way would be to display the time you recorded say y. And ask the user if you got it right or was it greater or lesser and is so by how much and you compute z.

Rather than asking the worker for the exact value, we would ask for the range. That would be less precise, but the chances of him/her accidentally getting it wrong would be minimized. We could give them options to choose from (these would be scientifically designed, equally spaced intervals)

We could separately report the averages of unedited time and edited time and a combination of the two.

PROMOTION

There could be automatic threshold barricades that a worker needs to cross in order to get promoted.



-- - --- --- --- --- --- - -- -- ---- ----x points, y tasks PROMOTION #i+1




--- --- -- --- - --- ---- - --- -- --- ---- ------x points, y tasks PROMOTION #i





-- --- -- -- -- - -- --- -- -- - --- - -- --- -- --- ---- -- x' points, y' tasks PROMOTION #i-1




--- -- --- --- --- ----- -- -- -- --- ---- ---- --- -- -- x points, y tasks PROMOTION #i-2




and so on and so forth.

We could also have tasks to be reviewed and peer reviewed and then you get promoted based on the ranking and the points and badges received. Promotion here would be more like a social decision (they would be paid for it).

Or we could have automatic ranking systems where promotions are also pretty automatic (higher ranked) and would manifest as reputation within the system.

OR we could have third body decisions and a person performing well consistently would be considered for promotion.

This could be considered but we might need to make a collective decision on what we would choose maybe supported with substantial experimentation and data.

WHAT DOES THE GUILD NEED?

We could make a discourse system where there is a flair level and a slack kind of an organization with management options and so on.

MILESTONE CONTRIBUTIONS

@gagana_b