WinterMilestone 5 BPHC Experimental Design and Statistical Analysis of Task Authorship
We're going to borrow methods section from this paper as an example: Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.. Please note how this section was divided into different parts. Please follow the same template.
STUDY 1: ETA VS. OTHER MEASURES OF EFFORT We begin by comparing ETA and other measures of difficulty (including time and subjective difficulty) across a number of common crowdsourcing tasks. After describing the experimental setup, designed to elicit the necessary data to generate error-time curves and other measures for each task, we show how closely the different measures matched.
Method: Study 1 and all subsequent experiments reported in this paper were conducted using a proprietary microtasking platform that outsources crowd work to workers on the Clickworker microtask market. The platform interface is similar to that of Amazon Mechanical Turk; users upload HTML task files, workers choose from a marketplace listing of tasks, and data is collected in CSV files. We restricted workers to those residing in the United States. Across all studies, 470 unique workers completed over 44 thousand tasks. A followup survey revealed that approximately 66% were female. We replicated Study 1 on Amazon Mechanical Turk and found empirically
similar results, so we only report results using Clickworker in this paper.
Method specifics and details
Primitive Crowdsourcing Task Types We began by populating our evaluation tasks with common crowdsourcing task types, or primitives, that appear commonly as microtasks or parts of microtasks. To do this, we looked at the types of tasks with the most available HITs on Amazon Mechanical Turk, at reports on common crowdsourcing task types , and at crowdsourcing systems described in the literature (e.g., ). After several iterations we identified a list of ten primitives that are present in most crowdsourcing workflows (Table 1, Figure 2). For example, the Find-Fix-Verify workflow  could be expressed using a combination of the FIND (identify sentences which need shortening), FIX (shortening these sentences), and BINARY primitives (verifying the shortening is an improvement). In many cases, the primitives themselves (or repetitions of the same primitive) make up the entire task, and map directly to common Mechanical Turk tasks (e.g., finding facts such as phone numbers about individuals (SEARCH)). We instantiated these primitives using a dataset of images of people performing different actions (e.g., waving, cooking)  and a corpus of translated Wikipedia articles selected because they tend to contain errors .
Experimental Design for the study
We presented workers with a mixed series of tasks from the ten primitives and manipulated two factors: the time limit and the primitive. Each primitive had seven different possible time limits, and one untimed condition. The exact time limits were initialized using how long workers took when not under time pressure. The result was a sampled, not fully-crossed, design. For each worker we randomly selected five primitives for them to perform; for each primitive, three questions of that type were shown with each of the specified time limits. The images or text used in these questions were randomly sampled and shuffled for each worker. To minimize practice effects, workers completed three timed practice questions prior to seeing any of these conditions. The tasks were presented in randomized order, and within each primitive the time conditions were presented in randomized order. Workers were compensated $2.00 and repeat participation was disallowed. A single task was presented on each page, allowing us to record how long workers took to submit a response. Under timed conditions, a timer started as soon as the worker advanced to the next page. Input was disabled as soon as the timer expired, regardless of what the worker was doing (e.g., typing, clicking). An example task is shown in Figure 3.
Measures from the study
The information we logged allowed us to calculate behavioral measures for each primitive: – ETA. The ETA is the area under the error-time curve. – Time@10. We also calculated the time it takes to achieve an error rate at the 10th percentile. – Error. We measured the error rate against ground truth for each primitive. If there were many possible correct responses, we manually judged responses while blind to condition. Automatically computing distance metrics (e.g., edit distance) resulted in empirically similar findings. – Time. We measured how long workers took to complete the primitive without any time limit. After each task block was complete, we additionally asked workers to record several subjective reflections: – Estimated time. We asked workers to report how long they thought they spent on a primitive absent time pressure. Time estimation has previously been used as an implicit signal of task difficulty . – Relative subjective duration (RSD). RSD, a measure of how much task time is over- or underestimated , is obtained by dividing the difference between estimated and actual time spent by the actual time spent. – Task load index (TLX). The NASA TLX  is a validated metric of mental workload commonly used in human factors research to assess task performance. It consists of a survey that sums six subjective dimensions (e.g., mental demand). A separate experimental design that contained all ten primitives, where each worker completed three untimed practice questions followed by three untimed questions for each primtive (with the primitives presented in random order), was used to obtain the – Subjective rank. Workers considered all of the primitives they completed and ranked them in order of effort required. As rankings produce sharper distinctions than individual ratings , we consider subjective rank to represent our ground truth ranking of the primitives. However, rank would not be a deployable solution for requesters. Ranking means that workers would need to test the new task against at least log(n) of the primitives, incurring a large fixed overhead. Further, ranking is ordinal, and cannot quantify small changes in effort. In contrast, ETA is an absolute ranking, can measure small changes in effort, and only needs to be measured for the target task to compare it with other tasks.
What do we want to analyze?
Analysis 60 workers completed Study 1, with 30 performing each primitive. We averaged our dependent measures across all 30 workers, and compared the ranking of primitives induced by each measure to the average subjective ranking (subjective rank was obtained by having 40 other workers rank all ten primitives). We used the Kendall rank correlation coeffi- cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under the null hypothesis of no association. A rank correlation of 1 indicates perfect correlation; 0 indicates no correlation. Measures that capture the subjective ranking accurately can analyze new tasks types without comparing them against multiple benchmark tasks.