Difference between revisions of "Winter Milestone 5 Templates"

From crowdresearch
Jump to: navigation, search
(Undo revision 17774 by Vrindabhatia (talk))
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
Please use the following template to write up your introduction section this week.  
 
Please use the following template to write up your introduction section this week.  
  
== System ==
+
== System (for task feed and open gov write up) ==
  
We're going to borrow introduction section from this paper as an example: [[:Media:Twitch Crowdsourcing (private).pdf | Vaish R, Wyngarden K, Chen J, et al. Twitch crowdsourcing: crowd contributions in short bursts of time. Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014: 3645-3654.]] Please note how this section was divided into different parts. Please follow the same template.  
+
We're going to borrow systems section from this paper as an example: [[:Media:Twitch Crowdsourcing (private).pdf | Vaish R, Wyngarden K, Chen J, et al. Twitch crowdsourcing: crowd contributions in short bursts of time. Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014: 3645-3654.]] Please note how this section was divided into different parts. Please follow the same template.  
  
 +
=== Brief introduction of the system ===
 +
Twitch is an Android application that appears when the user
 +
presses the phone’s power/lock button (Figures 1 and 3).
 +
When the user completes the twitch crowdsourcing task, the
 +
phone unlocks normally. Each task involves a choice
 +
between two to six options through a single motion such as
 +
a tap or swipe.
  
=== Specific problem being solved (not just crowdsourcing, but getting into specifics) ===  
+
=== How is the system solving critical problems ===  
 +
To motivate continued participation, Twitch provides both
 +
instant and aggregated feedback to the user. An instant feedback display shows how many other users agreed via a
 +
fadeout as the lock screen disappears (Figure 4) or how the
 +
user’s contributions apply to the whole (Figure 5).
 +
Aggregated data is also available via a web application,
 +
allowing the user to explore all data that the system has
 +
collected. For example, Figure 2 shows a human generated
 +
map from the Census application.
 +
To address security concerns, users are allowed to either
 +
disable or keep their existing Android passcode while using
 +
Twitch. If users do not wish to answer a question, they may
 +
skip Twitch by selecting ‘Exit’ via the options menu. This
 +
design decision has been made to encourage the user to give
 +
Twitch an answer, which is usually faster than exiting.
 +
Future designs could make it easier to skip a task, for
 +
example through a swipe-up.
  
Mobilizing participation is a central challenge for every
+
=== Introducing modules of the system ===
crowdsourcing campaign. Campaigns that cannot motivate
+
Below, we introduce the three main crowdsourcing
enough participants will fail. Unfortunately, many
+
applications that Twitch supports. The first, Census,
interested contributors simply cannot find enough time:
+
attempts to capture local knowledge. The following two,
lack of time is the top reason that subject experts do not
+
Image Voting and Structuring the Web, draw on creative
contribute to Wikipedia. Those who do participate in
+
and topical expertise. These three applications are bundled
crowdsourcing campaigns often drop out when life
+
into one Android package, and each can be accessed
becomes busy. Even seemingly small time
+
interchangeably through Twitch's settings menu.
requirements can dissuade users: psychologists define
+
channel factors as the small but critical barriers to action
+
that have a disproportionate effect on whether people
+
complete a goal.  
+
  
=== Motivation ===
+
=== Module 1: Census ===  
Despite this constraint, many
+
crowdsourcing campaigns assume that participants will
+
work for minutes or hours at once, leading to a granularity
+
problem where task size is poorly matched to
+
contributors’ opportunities. We speculate that a great
+
number of crowdsourcing campaigns will struggle to
+
succeed as long as potential contributors are deterred by the
+
time commitment.
+
  
=== Introducing the concept, the high level insight ===  
+
==== Problem/Limitations ====
To engage a wider set of crowdsourcing contributors, we
+
Despite progress in producing effective understanding of
introduce twitch crowdsourcing: interfaces that encourage
+
static elements of our physical world — routes, businesses
contributions of a few seconds at a time. Taking advantage
+
and points of interest — we lack an understanding of
of the common habit of turning to mobile phones in spare
+
human activity. How busy is the corner cafe at 2pm on
moments, we replace the mobile phone unlock screen
+
Fridays? What time of day do businesspeople clear out of
with a brief crowdsourcing task, allowing each user to make
+
the downtown district and get replaced by socializers?
small, compounded volunteer contributions over time. In
+
Which neighborhoods keep high-energy activities going
contrast, existing mobile crowdsourcing platforms (e.g.,
+
until 11pm, and which ones become sleepy by 6pm? Users
[12,16,22]) tend to assume long, focused runs of work. Our
+
could take advantage of this information to plan their
design challenge is thus to create crowdsourcing tasks that
+
commutes, their social lives and their work.
operate in very short time periods and at low cognitive load.
+
  
=== The system ===  
+
==== Module preview ====
To demonstrate the opportunities of twitch crowdsourcing,
+
Existing crowdsourced techniques such as Foursquare are
we present Twitch, a crowdsourcing platform for Android
+
too sparse to answer these kinds of questions: the answers
devices that augments the unlock screen with 1–3 second
+
require at-the-moment, distributed human knowledge. We
volunteer crowdsourcing tasks (Figure 1). Rather than a
+
envision that twitch crowdsourcing can help create a
typical slide-to-unlock mechanism, the user unlocks their
+
human-centered equivalent of Google Street View, where a
phone by completing a brief crowdsourcing task. Twitch is
+
user could browse typical crowd activity in an area. To do
publicly deployed and has collected over eleven thousand
+
so, we ask users to answer one of several questions about the world around them each time they unlock their phone.
volunteer contributions to date. The system sits aside any
+
Users can then browse the map they are helping create.
existing security passcodes on the phone.
+
  
=== System details ===
+
==== System details ====  
Twitch crowdsourcing allows designers to tap into local and
+
Census is the default crowdsourcing task in Twitch. It
topical expertise from mobile users. Twitch supports three
+
collects structured information about what people
unlock applications:
+
experience around them. Each Census unlock screen
1) Census envisions a realtime people-centered world
+
consists of four to six tiles (Figures 1 and 3), each task
census: where people are, what they are doing, and how
+
centered around questions such as:
they are doing it. For example, how busy is the corner café
+
• How many people are around you?
at 2pm on Fridays? Census answers these questions by
+
• What kinds of attire are nearby people wearing?
asking users to share information about their surroundings
+
• What are you currently doing?
as they navigate the physical world, for example the size of
+
• How much energy do you have right now?
the crowd or current activities (Figure 1).
+
While not exhaustive, these questions cover several types of
2) Photo Ranking captures users’ opinion between two
+
information that a local census might seek to provide. Two
photographs. In formative work with product designers, we
+
of the four questions ask users about the people around
found that they require stock photos for mockups, but stock
+
them, while the other two ask about users themselves; both
photo sites have sparse ratings. Likewise, computer vision
+
of which they are uniquely equipped to answer. Each
needs more data to identify high-quality images from the
+
answer is represented graphically; for example, in case of
web. Photo Ranking (Figure 1) asks users to swipe to
+
activities, users have icons for working, at home, eating,
choose the better of two stock photos on a theme, or
+
travelling, socializing, or exercising.
contribute their own through their cell phone camera.
+
To motivate continued engagement, Census provides two
 +
modes of feedback. Instant feedback (Figure 4) is a brief
 +
Android popup message that appears immediately after the
 +
user makes a selection. It reports the percentage of
 +
responses in the current time bin and location that agreed
 +
with the user, then fades out within two seconds. It is
 +
transparent to user input, so the user can begin interacting
 +
with the phone even while it is visible. Aggregated report
 +
allows Twitch users to see the cumulative effect of all
 +
users’ behavior. The data is bucketed and visualized on a
 +
map (Figure 2) on the Twitch homepage. Users can filter
 +
the data based on activity type or time of day.
  
3) Structuring the Web helps transform the written web into
 
a format that computers can understand. Users specify an
 
area of expertise — HCI, the Doctor Who television series,
 
or anything else of interest on Wikipedia — and help verify
 
web extractions relevant to that topic. Each unlock involves
 
confirming or rejecting a short extraction. In doing so, users
 
could power a fact-oriented search engine that would
 
directly answer queries like “heuristic evaluation creator”.
 
After making a selection, Twitch users can see whether
 
their peers agreed with their selection. In addition, they can
 
see how their contribution is contributing to the larger
 
whole, for example aggregate responses on a map (Figure
 
2) or in a fact database (Figure 5).
 
  
=== Evaluation method and results ===  
+
=== Module 2: Photo Ranking ===
We deployed Twitch publicly on the web and attracted 82
+
users to install Twitch on their primary phones. Over three
+
weeks, the average user unlocked their phone using Twitch
+
19 times per day. Users contributed over 11,000 items to
+
our crowdsourced database, covering several cities with
+
local census information. The median Census task unlock
+
took 1.6 seconds, compared to 1.4 seconds for a standard
+
slide-to-unlock gesture. Secondary task studies
+
demonstrated that Twitch unlocks added minimal cognitive
+
load to the user.
+
Our work indicates that it may be possible to engage a
+
broad set of new participants in crowdsourcing campaigns
+
as they go about their day or have a few spare moments. In
+
the following sections, we introduce twitch crowdsourcing
+
in more detail and report on our public deployment and
+
field experiments.
+
  
== Science ==
+
==== Problem/Limitations ====
 +
Beyond harnessing local observations via Census, we
 +
wanted to demonstrate that twitch crowdsourcing could
 +
support traditional crowdsourcing tasks such as image ranking (e.g., Matchin [17]). Needfinding interviews and
 +
prototyping sessions with ten product design students at
 +
Stanford University indicated that product designers not
 +
only need photographs for their design mockups, but they
 +
also enjoy looking at the photographs. Twitch harnesses
 +
this interest to help rank photos and encourage contribution
 +
of new photos.
  
We're going to borrow introduction section from this paper as an example: [[:Media:2015 eta (private).pdf | Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]]. Please note how this section was divided into different parts. Please follow the same template.
+
==== Module details ====
 +
Photo Ranking crowdsources a ranking of stock photos for
 +
themes from a Creative Commons-licensed image library.
 +
The Twitch task displays two images related to a theme
 +
(e.g., Nature Panorama) per unlock and asks the user to
 +
slide to select the one they prefer (Figure 1). Pairwise
 +
ranking is considered faster and more accurate than rating
 +
[17]. The application regularly updates with new photos.
 +
Users can optionally contribute new photos to the database
 +
by taking a photo instead of rating one. Contributed photos
 +
must be relevant to the day’s photo theme, such as Nature
 +
Panorama, Soccer, or Beautiful Trash. Contributing a photo
 +
takes longer than the average Twitch task, but provides an
 +
opportunity for motivated individuals to enter the
 +
competition and get their photos rated.
 +
Like with Census, users receive instant feedback through a
 +
popup message to display how many other users agreed
 +
with their selection. We envision a web interface where all
 +
uploaded images can be browsed, downloaded and ranked.
 +
This data can also connect to computer vision research by
 +
providing high-quality images of object categories and
 +
scenes to create better classifiers.
  
 +
=== Module 3: Structuring the Web ===
  
=== Phenomenon you're interested in ===
+
==== Problem/Limitations ====  
Imagine that a requester wants to use Amazon Mechanical
+
Search engines no longer only return documents — they
Turk to label 10,000 images with a fixed set of tags. How
+
now aim to return direct answers [6,9]. However, despite
much should workers be paid to label each image? Would labeling
+
massive undertakings such as the Google Knowledge Graph
an image with twice as many tags result in a task that
+
[36], Bing Satori [37] and Freebase [7], much of the
is twice as much effort? Should the tags be provided in a drop
+
knowledge on the web remains unstructured and unavailable for interactive applications. For example,
down list or with radio buttons? Answering these questions
+
searching for ‘Weird Al Yankovic born’ in a search engine
requires a fine-grained understanding of the amount of effort
+
such as Google returns a direct result ‘1959’ drawn from
the task requires. This process today involves trial and error:
+
the knowledge base; however, searching for the equally
requesters observe the wait time and quality on test tasks,
+
relevant ‘Weird Al Yankovic first song’, ‘Weird Al
guess what might have been causing any problems, tweak the
+
Yankovic band members’, or ‘Weird Al Yankovic
task, and repeat. An accurate measure of the effort required
+
bestselling album’ returns a long string of documents but no
to complete a crowdsourced task would enable requesters to
+
direct answer, even though the answers are readily available
 +
on the performer’s Wikipedia page.
  
compare different approaches to their tasks, iterate toward a
+
==== Module preview ====
better design, and price their tasks objectively. It could also
+
To enable direct answers, we need structured data that is
help workers decide whether to accept a task, or even allow
+
computer-readable. While crowdsourced undertakings such
systems to offer tasks based on difficulty or time availability.
+
as Freebase and dbPedia have captured much structured
However, despite its potential value, task effort is challenging
+
data, they tend to only acquire high-level information and
to estimate. Workers face cognitive biases in assessing diffi-
+
do not have enough contributors to achieve significant
culty [21], while requesters cannot easily observe the process
+
depth on any single entity. Likewise, while information
and, as experts, categorically underestimate completion times
+
extraction systems such as ReVerb [14] automatically draw
[12]. These limits suggest the need for a behavioral approach
+
such information from the text of the Wikipedia page, their
to measure effort. One approach might be to let the market
+
error rates are currently too high to trust. Crowdsourcing
identify hard tasks by reacting to the posted price [30].  
+
can help such systems identify errors to improve future
 +
accuracy [18]. Therefore, we apply twitch crowdsourcing to
 +
produce both structured data for interactive applications and
 +
training data for information extraction systems.
  
=== The puzzle (observations we can't account for yet) ===  
+
==== Module details ====
However, prices cannot easily make fine distinctions in an inelastic
+
Contributors to online efforts are drawn to goals that allow
market such as Mechanical Turk [14]. Another approach
+
them to exhibit their unique expertise [2]. Thus, we allow
might be to use task duration as a signal of difficulty, but this
+
users to help create structured data for topics of interest.
is unreliable because workers regularly accept multiple tasks
+
The user can specify any topic on Wikipedia that they are
simultaneously and interleave work [29]. Measures such as
+
interested in or want to learn about, for example HCI, the
reaction time [32] are not easy to apply to typical crowd tasks:
+
Godfather films, or their local city. To do so within a oneto-two
reaction time metrics tend to use simplistic tasks (e.g., shape
+
second time limit, we draw on mixed-initiative
or color recognition), while others may be too involved for
+
information extraction systems (e.g., [18]) and ask users to
crowd work (e.g., [9]).
+
help vet automatic extractions.
 +
When a user unlocks his or her phone, Structuring the Web
 +
displays a high-confidence extraction generated using
 +
ReVerb, and its source statement from the selected
 +
Wikipedia page (Figure 1). The user indicates with one
 +
swipe whether the extraction is correct with respect to the
 +
statement. ReVerb produces an extraction in SubjectRelationship-Object
 +
format: for example, if the source
 +
statement is “Stanford University was founded in 1885 by
 +
Leland Stanford as a memorial to their son”, ReVerb
 +
returns {Stanford University}, {was founded in}, {1885}
 +
and Twitch displays this structure. To minimize cognitive
 +
load and time requirements, the application filters only
 +
include short source sentences and uses color coding to
 +
match extractions with the source text.
 +
In Structuring the Web, the instant feedback upon accepting
 +
an extraction shows the user their progress growing a
 +
knowledge tree of verified facts (Figure 5). Rejecting an
 +
extraction instead scrolls the user down the article as far as
 +
their most recent extraction source, demonstrating the
 +
user’s progress in processing the article. In the future, we
 +
envision that search engines can utilize this data to answer a
 +
wider range of factual queries.
  
=== Experimental design ===
+
== Methods (for task authorship write up) ==
In this paper, we propose a data-driven behavioral measure
+
of effort that can be easily and cheaply calculated using the
+
crowd. Our metric, the error time area (ETA), draws on
+
cognitive psychology literature on speed-accuracy tradeoff
+
curves [32], and represents the effort required for a worker to
+
accurately complete a task. To create it, we first recruit workers
+
to complete the task under different time limits. Next, we
+
fit a curve to the collected data relating the error rate and time
+
limit (Figure 1). Last, we compute ETA by taking the area
+
under this error-time curve. Because ETA is calculated using
+
a data-driven approach, task difficulty can be determined
+
with minimal effort and without analytical modeling. Rather
+
than measuring average duration independent of work quality,
+
ETA computes quality as a function of duration and thus
+
can be used to estimate a wage for a task. ETA also allows
+
requesters to compare multiple task designs; for example, we
+
find that tagging an image with an open textbox is less effort
+
than choosing between a fixed list of 16 options, but more
+
effort than choosing between a fixed list of 8 options.
+
  
=== Evaluation methods ===
+
We're going to borrow methods section from this paper as an example: [[:Media:2015 eta (private).pdf | Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]]. Please note how this section was divided into different parts. Please follow the same template.
  
After describing ETA, we explore the metric via four studies:
+
=== Study introduction ===
– Study 1: ETA vs. other measures of effort. For ten common
+
STUDY 1: ETA VS. OTHER MEASURES OF EFFORT
microtasking primitives (e.g., multiple choice questions,
+
We begin by comparing ETA and other measures of difficulty
long-form text entry), we show that the ETA metric
+
(including time and subjective difficulty) across a number of
represents effort better than existing measures.
+
common crowdsourcing tasks. After describing the experimental
– Study 2: ETA vs. market price. We then compare ETA as
+
setup, designed to elicit the necessary data to generate
well as other measures to the market prices of these primitives
+
error-time curves and other measures for each task, we show
on a crowdsourcing platform.
+
how closely the different measures matched.
– Study 3: Modeling perceptual costs. By augmenting ETA
+
with measures of perceptual effort, we find we can better
+
model a worker’s perceived difficulty of a task.
+
– Study 4: Tasks without ground truth. In order to capture
+
how well people do a task, ETA requires ground truth. We
+
extend the metric to also work for subjective tasks.
+
  
=== Result (what you'd imagine would happen) ===  
+
=== Study method ===
We then demonstrate how ETA can be used for rapidly prototyping
+
Method: Study 1 and all subsequent experiments reported in this paper
tasks. ETA makes it possible to characterize tasks in
+
were conducted using a proprietary microtasking platform
terms of their monetary cost and human effort, and paves the
+
that outsources crowd work to workers on the Clickworker
way for better task design, payment, and allocation.
+
microtask market. The platform interface is similar to that
 +
of Amazon Mechanical Turk; users upload HTML task files,
 +
workers choose from a marketplace listing of tasks, and data
 +
is collected in CSV files. We restricted workers to those residing
 +
in the United States. Across all studies, 470 unique workers
 +
completed over 44 thousand tasks. A followup survey
 +
revealed that approximately 66% were female. We replicated
 +
Study 1 on Amazon Mechanical Turk and found empirically
 +
 
 +
similar results, so we only report results using Clickworker in
 +
this paper.
 +
 
 +
=== Method specifics and details ===
 +
Primitive Crowdsourcing Task Types
 +
We began by populating our evaluation tasks with common
 +
crowdsourcing task types, or primitives, that appear commonly
 +
as microtasks or parts of microtasks. To do this, we
 +
looked at the types of tasks with the most available HITs
 +
on Amazon Mechanical Turk, at reports on common crowdsourcing
 +
task types [15], and at crowdsourcing systems described
 +
in the literature (e.g., [4]). After several iterations
 +
we identified a list of ten primitives that are present in most
 +
crowdsourcing workflows (Table 1, Figure 2). For example,
 +
the Find-Fix-Verify workflow [4] could be expressed using
 +
a combination of the FIND (identify sentences which need
 +
shortening), FIX (shortening these sentences), and BINARY
 +
primitives (verifying the shortening is an improvement). In
 +
many cases, the primitives themselves (or repetitions of the
 +
same primitive) make up the entire task, and map directly to
 +
common Mechanical Turk tasks (e.g., finding facts such as
 +
phone numbers about individuals (SEARCH)).
 +
We instantiated these primitives using a dataset of images of
 +
people performing different actions (e.g., waving, cooking)
 +
[34] and a corpus of translated Wikipedia articles selected because
 +
they tend to contain errors [1].
 +
 
 +
=== Experimental Design for the study ===  
 +
We presented workers with a mixed series of tasks from the
 +
ten primitives and manipulated two factors: the time limit
 +
and the primitive. Each primitive had seven different possible
 +
time limits, and one untimed condition. The exact time limits
 +
were initialized using how long workers took when not under
 +
time pressure. The result was a sampled, not fully-crossed,
 +
design. For each worker we randomly selected five primitives
 +
for them to perform; for each primitive, three questions of that
 +
type were shown with each of the specified time limits. The
 +
images or text used in these questions were randomly sampled
 +
and shuffled for each worker. To minimize practice effects,
 +
workers completed three timed practice questions prior
 +
to seeing any of these conditions. The tasks were presented
 +
in randomized order, and within each primitive the time conditions
 +
were presented in randomized order. Workers were
 +
compensated $2.00 and repeat participation was disallowed.
 +
A single task was presented on each page, allowing us to
 +
record how long workers took to submit a response. Under
 +
timed conditions, a timer started as soon as the worker advanced
 +
to the next page. Input was disabled as soon as the
 +
timer expired, regardless of what the worker was doing (e.g.,
 +
typing, clicking). An example task is shown in Figure 3.
 +
 
 +
=== Measures from the study ===
 +
The information we logged allowed us to calculate behavioral
 +
measures for each primitive:
 +
ETA. The ETA is the area under the error-time curve.
 +
– Time@10. We also calculated the time it takes to achieve
 +
an error rate at the 10th percentile.
 +
– Error. We measured the error rate against ground truth
 +
for each primitive. If there were many possible correct
 +
responses, we manually judged responses while blind to
 +
condition. Automatically computing distance metrics (e.g.,
 +
edit distance) resulted in empirically similar findings.
 +
– Time. We measured how long workers took to complete the
 +
primitive without any time limit.
 +
After each task block was complete, we additionally asked
 +
workers to record several subjective reflections:
 +
– Estimated time. We asked workers to report how long they
 +
thought they spent on a primitive absent time pressure.
 +
Time estimation has previously been used as an implicit
 +
signal of task difficulty [5].
 +
– Relative subjective duration (RSD). RSD, a measure of
 +
how much task time is over- or underestimated [5], is obtained
 +
by dividing the difference between estimated and
 +
actual time spent by the actual time spent.
 +
– Task load index (TLX). The NASA TLX [10] is a validated
 +
metric of mental workload commonly used in human factors
 +
research to assess task performance. It consists of a
 +
survey that sums six subjective dimensions (e.g., mental
 +
demand).
 +
A separate experimental design that contained all ten primitives,
 +
where each worker completed three untimed practice
 +
questions followed by three untimed questions for each primtive
 +
(with the primitives presented in random order), was used
 +
to obtain the
 +
– Subjective rank. Workers considered all of the primitives
 +
they completed and ranked them in order of effort required.
 +
As rankings produce sharper distinctions than individual ratings
 +
[2], we consider subjective rank to represent our ground
 +
truth ranking of the primitives. However, rank would not be a
 +
deployable solution for requesters. Ranking means that workers
 +
would need to test the new task against at least log(n)
 +
of the primitives, incurring a large fixed overhead. Further,
 +
ranking is ordinal, and cannot quantify small changes in effort.
 +
In contrast, ETA is an absolute ranking, can measure
 +
small changes in effort, and only needs to be measured for
 +
the target task to compare it with other tasks.
 +
 
 +
=== What do we want to analyze? ===
 +
Analysis
 +
60 workers completed Study 1, with 30 performing each
 +
primitive. We averaged our dependent measures across all
 +
30 workers, and compared the ranking of primitives induced
 +
by each measure to the average subjective ranking (subjective
 +
rank was obtained by having 40 other workers rank all
 +
ten primitives). We used the Kendall rank correlation coeffi-
 +
cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under
 +
the null hypothesis of no association. A rank correlation
 +
of 1 indicates perfect correlation; 0 indicates no correlation.
 +
Measures that capture the subjective ranking accurately can
 +
analyze new tasks types without comparing them against multiple
 +
benchmark tasks.

Latest revision as of 07:21, 14 February 2016

Please use the following template to write up your introduction section this week.

System (for task feed and open gov write up)

We're going to borrow systems section from this paper as an example: Vaish R, Wyngarden K, Chen J, et al. Twitch crowdsourcing: crowd contributions in short bursts of time. Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014: 3645-3654. Please note how this section was divided into different parts. Please follow the same template.

Brief introduction of the system

Twitch is an Android application that appears when the user presses the phone’s power/lock button (Figures 1 and 3). When the user completes the twitch crowdsourcing task, the phone unlocks normally. Each task involves a choice between two to six options through a single motion such as a tap or swipe.

How is the system solving critical problems

To motivate continued participation, Twitch provides both instant and aggregated feedback to the user. An instant feedback display shows how many other users agreed via a fadeout as the lock screen disappears (Figure 4) or how the user’s contributions apply to the whole (Figure 5). Aggregated data is also available via a web application, allowing the user to explore all data that the system has collected. For example, Figure 2 shows a human generated map from the Census application. To address security concerns, users are allowed to either disable or keep their existing Android passcode while using Twitch. If users do not wish to answer a question, they may skip Twitch by selecting ‘Exit’ via the options menu. This design decision has been made to encourage the user to give Twitch an answer, which is usually faster than exiting. Future designs could make it easier to skip a task, for example through a swipe-up.

Introducing modules of the system

Below, we introduce the three main crowdsourcing applications that Twitch supports. The first, Census, attempts to capture local knowledge. The following two, Image Voting and Structuring the Web, draw on creative and topical expertise. These three applications are bundled into one Android package, and each can be accessed interchangeably through Twitch's settings menu.

Module 1: Census

Problem/Limitations

Despite progress in producing effective understanding of static elements of our physical world — routes, businesses and points of interest — we lack an understanding of human activity. How busy is the corner cafe at 2pm on Fridays? What time of day do businesspeople clear out of the downtown district and get replaced by socializers? Which neighborhoods keep high-energy activities going until 11pm, and which ones become sleepy by 6pm? Users could take advantage of this information to plan their commutes, their social lives and their work.

Module preview

Existing crowdsourced techniques such as Foursquare are too sparse to answer these kinds of questions: the answers require at-the-moment, distributed human knowledge. We envision that twitch crowdsourcing can help create a human-centered equivalent of Google Street View, where a user could browse typical crowd activity in an area. To do so, we ask users to answer one of several questions about the world around them each time they unlock their phone. Users can then browse the map they are helping create.

System details

Census is the default crowdsourcing task in Twitch. It collects structured information about what people experience around them. Each Census unlock screen consists of four to six tiles (Figures 1 and 3), each task centered around questions such as: • How many people are around you? • What kinds of attire are nearby people wearing? • What are you currently doing? • How much energy do you have right now? While not exhaustive, these questions cover several types of information that a local census might seek to provide. Two of the four questions ask users about the people around them, while the other two ask about users themselves; both of which they are uniquely equipped to answer. Each answer is represented graphically; for example, in case of activities, users have icons for working, at home, eating, travelling, socializing, or exercising. To motivate continued engagement, Census provides two modes of feedback. Instant feedback (Figure 4) is a brief Android popup message that appears immediately after the user makes a selection. It reports the percentage of responses in the current time bin and location that agreed with the user, then fades out within two seconds. It is transparent to user input, so the user can begin interacting with the phone even while it is visible. Aggregated report allows Twitch users to see the cumulative effect of all users’ behavior. The data is bucketed and visualized on a map (Figure 2) on the Twitch homepage. Users can filter the data based on activity type or time of day.


Module 2: Photo Ranking

Problem/Limitations

Beyond harnessing local observations via Census, we wanted to demonstrate that twitch crowdsourcing could support traditional crowdsourcing tasks such as image ranking (e.g., Matchin [17]). Needfinding interviews and prototyping sessions with ten product design students at Stanford University indicated that product designers not only need photographs for their design mockups, but they also enjoy looking at the photographs. Twitch harnesses this interest to help rank photos and encourage contribution of new photos.

Module details

Photo Ranking crowdsources a ranking of stock photos for themes from a Creative Commons-licensed image library. The Twitch task displays two images related to a theme (e.g., Nature Panorama) per unlock and asks the user to slide to select the one they prefer (Figure 1). Pairwise ranking is considered faster and more accurate than rating [17]. The application regularly updates with new photos. Users can optionally contribute new photos to the database by taking a photo instead of rating one. Contributed photos must be relevant to the day’s photo theme, such as Nature Panorama, Soccer, or Beautiful Trash. Contributing a photo takes longer than the average Twitch task, but provides an opportunity for motivated individuals to enter the competition and get their photos rated. Like with Census, users receive instant feedback through a popup message to display how many other users agreed with their selection. We envision a web interface where all uploaded images can be browsed, downloaded and ranked. This data can also connect to computer vision research by providing high-quality images of object categories and scenes to create better classifiers.

Module 3: Structuring the Web

Problem/Limitations

Search engines no longer only return documents — they now aim to return direct answers [6,9]. However, despite massive undertakings such as the Google Knowledge Graph [36], Bing Satori [37] and Freebase [7], much of the knowledge on the web remains unstructured and unavailable for interactive applications. For example, searching for ‘Weird Al Yankovic born’ in a search engine such as Google returns a direct result ‘1959’ drawn from the knowledge base; however, searching for the equally relevant ‘Weird Al Yankovic first song’, ‘Weird Al Yankovic band members’, or ‘Weird Al Yankovic bestselling album’ returns a long string of documents but no direct answer, even though the answers are readily available on the performer’s Wikipedia page.

Module preview

To enable direct answers, we need structured data that is computer-readable. While crowdsourced undertakings such as Freebase and dbPedia have captured much structured data, they tend to only acquire high-level information and do not have enough contributors to achieve significant depth on any single entity. Likewise, while information extraction systems such as ReVerb [14] automatically draw such information from the text of the Wikipedia page, their error rates are currently too high to trust. Crowdsourcing can help such systems identify errors to improve future accuracy [18]. Therefore, we apply twitch crowdsourcing to produce both structured data for interactive applications and training data for information extraction systems.

Module details

Contributors to online efforts are drawn to goals that allow them to exhibit their unique expertise [2]. Thus, we allow users to help create structured data for topics of interest. The user can specify any topic on Wikipedia that they are interested in or want to learn about, for example HCI, the Godfather films, or their local city. To do so within a oneto-two second time limit, we draw on mixed-initiative information extraction systems (e.g., [18]) and ask users to help vet automatic extractions. When a user unlocks his or her phone, Structuring the Web displays a high-confidence extraction generated using ReVerb, and its source statement from the selected Wikipedia page (Figure 1). The user indicates with one swipe whether the extraction is correct with respect to the statement. ReVerb produces an extraction in SubjectRelationship-Object format: for example, if the source statement is “Stanford University was founded in 1885 by Leland Stanford as a memorial to their son”, ReVerb returns {Stanford University}, {was founded in}, {1885} and Twitch displays this structure. To minimize cognitive load and time requirements, the application filters only include short source sentences and uses color coding to match extractions with the source text. In Structuring the Web, the instant feedback upon accepting an extraction shows the user their progress growing a knowledge tree of verified facts (Figure 5). Rejecting an extraction instead scrolls the user down the article as far as their most recent extraction source, demonstrating the user’s progress in processing the article. In the future, we envision that search engines can utilize this data to answer a wider range of factual queries.

Methods (for task authorship write up)

We're going to borrow methods section from this paper as an example: Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.. Please note how this section was divided into different parts. Please follow the same template.

Study introduction

STUDY 1: ETA VS. OTHER MEASURES OF EFFORT We begin by comparing ETA and other measures of difficulty (including time and subjective difficulty) across a number of common crowdsourcing tasks. After describing the experimental setup, designed to elicit the necessary data to generate error-time curves and other measures for each task, we show how closely the different measures matched.

Study method

Method: Study 1 and all subsequent experiments reported in this paper were conducted using a proprietary microtasking platform that outsources crowd work to workers on the Clickworker microtask market. The platform interface is similar to that of Amazon Mechanical Turk; users upload HTML task files, workers choose from a marketplace listing of tasks, and data is collected in CSV files. We restricted workers to those residing in the United States. Across all studies, 470 unique workers completed over 44 thousand tasks. A followup survey revealed that approximately 66% were female. We replicated Study 1 on Amazon Mechanical Turk and found empirically

similar results, so we only report results using Clickworker in this paper.

Method specifics and details

Primitive Crowdsourcing Task Types We began by populating our evaluation tasks with common crowdsourcing task types, or primitives, that appear commonly as microtasks or parts of microtasks. To do this, we looked at the types of tasks with the most available HITs on Amazon Mechanical Turk, at reports on common crowdsourcing task types [15], and at crowdsourcing systems described in the literature (e.g., [4]). After several iterations we identified a list of ten primitives that are present in most crowdsourcing workflows (Table 1, Figure 2). For example, the Find-Fix-Verify workflow [4] could be expressed using a combination of the FIND (identify sentences which need shortening), FIX (shortening these sentences), and BINARY primitives (verifying the shortening is an improvement). In many cases, the primitives themselves (or repetitions of the same primitive) make up the entire task, and map directly to common Mechanical Turk tasks (e.g., finding facts such as phone numbers about individuals (SEARCH)). We instantiated these primitives using a dataset of images of people performing different actions (e.g., waving, cooking) [34] and a corpus of translated Wikipedia articles selected because they tend to contain errors [1].

Experimental Design for the study

We presented workers with a mixed series of tasks from the ten primitives and manipulated two factors: the time limit and the primitive. Each primitive had seven different possible time limits, and one untimed condition. The exact time limits were initialized using how long workers took when not under time pressure. The result was a sampled, not fully-crossed, design. For each worker we randomly selected five primitives for them to perform; for each primitive, three questions of that type were shown with each of the specified time limits. The images or text used in these questions were randomly sampled and shuffled for each worker. To minimize practice effects, workers completed three timed practice questions prior to seeing any of these conditions. The tasks were presented in randomized order, and within each primitive the time conditions were presented in randomized order. Workers were compensated $2.00 and repeat participation was disallowed. A single task was presented on each page, allowing us to record how long workers took to submit a response. Under timed conditions, a timer started as soon as the worker advanced to the next page. Input was disabled as soon as the timer expired, regardless of what the worker was doing (e.g., typing, clicking). An example task is shown in Figure 3.

Measures from the study

The information we logged allowed us to calculate behavioral measures for each primitive: – ETA. The ETA is the area under the error-time curve. – Time@10. We also calculated the time it takes to achieve an error rate at the 10th percentile. – Error. We measured the error rate against ground truth for each primitive. If there were many possible correct responses, we manually judged responses while blind to condition. Automatically computing distance metrics (e.g., edit distance) resulted in empirically similar findings. – Time. We measured how long workers took to complete the primitive without any time limit. After each task block was complete, we additionally asked workers to record several subjective reflections: – Estimated time. We asked workers to report how long they thought they spent on a primitive absent time pressure. Time estimation has previously been used as an implicit signal of task difficulty [5]. – Relative subjective duration (RSD). RSD, a measure of how much task time is over- or underestimated [5], is obtained by dividing the difference between estimated and actual time spent by the actual time spent. – Task load index (TLX). The NASA TLX [10] is a validated metric of mental workload commonly used in human factors research to assess task performance. It consists of a survey that sums six subjective dimensions (e.g., mental demand). A separate experimental design that contained all ten primitives, where each worker completed three untimed practice questions followed by three untimed questions for each primtive (with the primitives presented in random order), was used to obtain the – Subjective rank. Workers considered all of the primitives they completed and ranked them in order of effort required. As rankings produce sharper distinctions than individual ratings [2], we consider subjective rank to represent our ground truth ranking of the primitives. However, rank would not be a deployable solution for requesters. Ranking means that workers would need to test the new task against at least log(n) of the primitives, incurring a large fixed overhead. Further, ranking is ordinal, and cannot quantify small changes in effort. In contrast, ETA is an absolute ranking, can measure small changes in effort, and only needs to be measured for the target task to compare it with other tasks.

What do we want to analyze?

Analysis 60 workers completed Study 1, with 30 performing each primitive. We averaged our dependent measures across all 30 workers, and compared the ranking of primitives induced by each measure to the average subjective ranking (subjective rank was obtained by having 40 other workers rank all ten primitives). We used the Kendall rank correlation coeffi- cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under the null hypothesis of no association. A rank correlation of 1 indicates perfect correlation; 0 indicates no correlation. Measures that capture the subjective ranking accurately can analyze new tasks types without comparing them against multiple benchmark tasks.