WinterMilestone 5 DavidThompson

From crowdresearch
Jump to: navigation, search

Some initial framing thoughts

The requester variance we are looking to explore/explain, is not limited to task articulation in a crowd working environment. The problem of articulating a task that connects an input and an expected output (e.g. an audio file and a grammatically correct transcription of said audio file), but that is mediated by an 'other' (i.e. I explain the task at hand, but you do the work) has to be crucial to any managerial or organizational context. One wonders if (i) this hasn't already been explored in the organizational literature and (ii) whether this isn't grounded in cognitive psychology. Specifically, how are we evaluating the aggregate ability of requesters to derive translatable cognitive models of tasks for subsequent completion by (other) workers?

The remainder of this methods section is grounded in prototype theory, an area of cognitive psychology. This is completely new to me, having found this following my research into the above. I think this maybe useful for the task at hand, because (and I take this directly from -

"An efficient way to represent concepts would be to keep only the critical properties of a concept. This set of critical properties is sometimes called a prototype or schema. The idea of prototypes is that a person has a mental construct that identifies typical characteristics of various categories. When a person encounters a new object, he/she compares it to the prototypes in memory. If it matches the prototype for a chair well enough, the new object will be classified and treated as a chair. This approach allows new objects to be interpreted on the basis of previously learned information. It is a powerful approach because you do not need to store all previously seen chairs in long-term memory. Instead, only the prototype needs to be kept."

Methods (for task authorship write up)

Study introduction

STUDY 1: We explore the effect of providing requesters categorical prompts that elucidate central and critical components of well articulated primitive tasks. The task itself is then evaluated, as is the worker output. As a control condition, a sample of requesters is simply given the task creation activity with no exposure to the critical components.

Study method

Method: Study 1 and all subsequent experiments reported in this paper were conducted using the Amazon Mechanical Turk platform; users upload HTML task files, workers choose from a marketplace listing of tasks, and data is collected in CSV files. We restricted workers to those residing in the United States. Across all studies, 40 unique workers completed 160 tasks. A followup survey revealed that approximately Y% were female.

Method specifics and details

Primitive Crowdsourcing Task Types and Critical Components

In a previous study [1] 10 primitive tasks were identified as appearing in most crowdsourcing workflows. For each of the 10 primitive tasks we identified three critical components that elucidated the underlying task. These are presented as questions, for consideration by the requester, as they are designing the task. They appear as prompts within a task design interface (see Figure 1 below). As an example, for the binary choice primitive, the critical components are:

1. Have you clearly and unambiguously articulated the label(s) to be considered?

2. Is it clear which image(s) are to be tagged?

3. Has the choice criteria the worker should be applying been clearly articulated for deciding when a tag could meaningfully be associated to an image?

We note that more objective tasks require less complexity in the subjective description of their critical components - this is a reflection of the larger number of edge cases, or the underlying cognitive difficulty of the task to be attempted. The critical components for each primitive can be found in Table 1 (below).

Experimental Design for the study

We presented 20 requesters with a mixed series of three primitive tasks from the ten primitives and manipulated two factors: the number of critical component prompts (0, 1, 2, or 3) that they were exposed to during task creation and the primitive. The primitive tasks were presented in a randomized order, and each requester was given an ever increasing number of exposures to the critical components for each primitive task.

For example, the 1st requester is assigned primitive task 4 (Tag), and provided 0 critical component prompts during the task creation process. This requester was then assigned primitive task 7 (Transcribe) and provided 1 critical component prompt during the task creation process, etc. Each requester will have authored 4 primitive tasks, with an ever increasing amount of guidance provided in the form of exposure to critical component prompts.

Measures from the study

Upon completion of tasks, we launched them in batches through Amazon Mechanical Turk in 2 phases.

In phase 1, the individual tasks from each requester are sent in intact batches to a unique worker who is rewarded for completing all tasks in each batch. In phase 2, 20 randomized batches of tasks are created - each batch containing a control task (wherein the requester was given no exposure to critical components) and the revised tasks (wherein they were exposed to 1, 2, and 3 of the critical components for each primitive task). Each batch is given to a separate collection of 20 workers, who are rewarded for completing all tasks in each batch. In all phases, workers were unaware of whether they were attempting a revised or a control task.

Subsequent to these 2 phases, two independent groups of requesters saw two groups of submissions for each item in each task from each phase — from the revised and control tasks, randomized and blind to condition. They were asked to which group produced higher quality results according to their taste and by comparing with the ground truth. Finally, the requesters were asked to complete a survey regarding their impressions of the tasks.

What do we want to analyze?

For each phase, and for each critical component exposure treatment (1, 2, or 3) we performed a chi^2 test comparing the number of tasks for which requesters preferred the results from the treatment to the results from the control. Revised (exposed) groups were chosen [ ] than the control group, indicating that it produced [higher/lower]-quality results.

In addition, we created an aggregate class of tasks given any critical component exposure treatment (1, 2, and 3) and performed a chi^2 test comparing the number of tasks for which requester preferred the results from the treatment to the results from the control.


[1] Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015..

Milestone Contributors

Slack usernames of all who helped create this wiki page submission: @dcthompson