Difference between revisions of "Winter Milestone 5 @ftw"

From crowdresearch
Jump to: navigation, search
 
Line 21: Line 21:
  
 
==== System details ====  
 
==== System details ====  
BirdView is a the overall view of tasks which the requestor has uploaded on the platform for other workers to contribute. This system aims at 
+
BirdView is a the overall view of tasks which the requestor has uploaded on the platform for other workers to contribute. This system creates an interactive overview of all the tasks which have been provided to the user. This interactive system allows the requestor to view all the works put together as a donut with daemo colors and when then requestor hovers over the tasks, then the requestor will be able to view the details of the task popping out. Along with this an additional feature of showing the number of comments and clarifications from the workers will also be shown as numbers over the donut for each task. This will help the requestor know the difficulties faced by the workers in some particular task and prioritise his feedback in those tasks.
  
  
Line 27: Line 27:
  
 
==== Problem/Limitations ====
 
==== Problem/Limitations ====
Beyond harnessing local observations via Census, we
+
With less attention from the requestor on the worker's problems, the workers continue to work with their understanding of the problem or stop working thinking the job might get rejected due to improper work. To avoid this, the better feedback system is created.
wanted to demonstrate that twitch crowdsourcing could
+
support traditional crowdsourcing tasks such as image ranking (e.g., Matchin [17]). Needfinding interviews and
+
prototyping sessions with ten product design students at
+
Stanford University indicated that product designers not
+
only need photographs for their design mockups, but they
+
also enjoy looking at the photographs. Twitch harnesses
+
this interest to help rank photos and encourage contribution
+
of new photos.
+
  
 
==== Module details ====
 
==== Module details ====
Photo Ranking crowdsources a ranking of stock photos for
+
This feedback system is accessed when the requestor hovers over the work area in the donut chart. Once the requestor clicks over the number of comments of that particular task, he is provided with the list of questions and suggestions as a pop up and can start replying to the comments and also improve the instructions on the top of the pop up panel thus increasing worker understanding of the task and trust over the requestor.  
themes from a Creative Commons-licensed image library.
+
The Twitch task displays two images related to a theme
+
(e.g., Nature Panorama) per unlock and asks the user to
+
slide to select the one they prefer (Figure 1). Pairwise
+
ranking is considered faster and more accurate than rating
+
[17]. The application regularly updates with new photos.
+
Users can optionally contribute new photos to the database
+
by taking a photo instead of rating one. Contributed photos
+
must be relevant to the day’s photo theme, such as Nature
+
Panorama, Soccer, or Beautiful Trash. Contributing a photo
+
takes longer than the average Twitch task, but provides an
+
opportunity for motivated individuals to enter the
+
competition and get their photos rated.
+
Like with Census, users receive instant feedback through a
+
popup message to display how many other users agreed
+
with their selection. We envision a web interface where all
+
uploaded images can be browsed, downloaded and ranked.
+
This data can also connect to computer vision research by
+
providing high-quality images of object categories and
+
scenes to create better classifiers.
+
  
 
=== Module 3: Improved worker task acceptance due to clear instructions and feedback ===  
 
=== Module 3: Improved worker task acceptance due to clear instructions and feedback ===  
  
 
==== Problem/Limitations ====  
 
==== Problem/Limitations ====  
Search engines no longer only return documents — they
+
This problem of new comers and senior workers not accepting the requestors work because of improper tasks understanding.
now aim to return direct answers [6,9]. However, despite
+
massive undertakings such as the Google Knowledge Graph
+
[36], Bing Satori [37] and Freebase [7], much of the
+
knowledge on the web remains unstructured and unavailable for interactive applications. For example,
+
searching for ‘Weird Al Yankovic born’ in a search engine
+
such as Google returns a direct result ‘1959’ drawn from
+
the knowledge base; however, searching for the equally
+
relevant ‘Weird Al Yankovic first song’, ‘Weird Al
+
Yankovic band members’, or ‘Weird Al Yankovic
+
bestselling album’ returns a long string of documents but no
+
direct answer, even though the answers are readily available
+
on the performer’s Wikipedia page.
+
  
 
==== Module preview ====  
 
==== Module preview ====  
To enable direct answers, we need structured data that is
+
The more workers accept the tasks due to better understanding of the task, the more the task completion rate and also with good quality which satisfies one of the most important requirements from the requestor.  
computer-readable. While crowdsourced undertakings such
+
as Freebase and dbPedia have captured much structured
+
data, they tend to only acquire high-level information and
+
do not have enough contributors to achieve significant
+
depth on any single entity. Likewise, while information
+
extraction systems such as ReVerb [14] automatically draw
+
such information from the text of the Wikipedia page, their
+
error rates are currently too high to trust. Crowdsourcing
+
can help such systems identify errors to improve future
+
accuracy [18]. Therefore, we apply twitch crowdsourcing to
+
produce both structured data for interactive applications and
+
training data for information extraction systems.
+
 
+
==== Module details ====
+
Contributors to online efforts are drawn to goals that allow
+
them to exhibit their unique expertise [2]. Thus, we allow
+
users to help create structured data for topics of interest.
+
The user can specify any topic on Wikipedia that they are
+
interested in or want to learn about, for example HCI, the
+
Godfather films, or their local city. To do so within a oneto-two
+
second time limit, we draw on mixed-initiative
+
information extraction systems (e.g., [18]) and ask users to
+
help vet automatic extractions.
+
When a user unlocks his or her phone, Structuring the Web
+
displays a high-confidence extraction generated using
+
ReVerb, and its source statement from the selected
+
Wikipedia page (Figure 1). The user indicates with one
+
swipe whether the extraction is correct with respect to the
+
statement. ReVerb produces an extraction in SubjectRelationship-Object
+
format: for example, if the source
+
statement is “Stanford University was founded in 1885 by
+
Leland Stanford as a memorial to their son”, ReVerb
+
returns {Stanford University}, {was founded in}, {1885}
+
and Twitch displays this structure. To minimize cognitive
+
load and time requirements, the application filters only
+
include short source sentences and uses color coding to
+
match extractions with the source text.
+
In Structuring the Web, the instant feedback upon accepting
+
an extraction shows the user their progress growing a
+
knowledge tree of verified facts (Figure 5). Rejecting an
+
extraction instead scrolls the user down the article as far as
+
their most recent extraction source, demonstrating the
+
user’s progress in processing the article. In the future, we
+
envision that search engines can utilize this data to answer a
+
wider range of factual queries.
+
 
+
== Methods (for task authorship write up) ==
+
 
+
We're going to borrow methods section from this paper as an example: [[:Media:2015 eta (private).pdf | Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.]]. Please note how this section was divided into different parts. Please follow the same template.
+
 
+
=== Study introduction ===
+
STUDY 1: ETA VS. OTHER MEASURES OF EFFORT
+
We begin by comparing ETA and other measures of difficulty
+
(including time and subjective difficulty) across a number of
+
common crowdsourcing tasks. After describing the experimental
+
setup, designed to elicit the necessary data to generate
+
error-time curves and other measures for each task, we show
+
how closely the different measures matched.
+
 
+
=== Study method ===
+
Method: Study 1 and all subsequent experiments reported in this paper
+
were conducted using a proprietary microtasking platform
+
that outsources crowd work to workers on the Clickworker
+
microtask market. The platform interface is similar to that
+
of Amazon Mechanical Turk; users upload HTML task files,
+
workers choose from a marketplace listing of tasks, and data
+
is collected in CSV files. We restricted workers to those residing
+
in the United States. Across all studies, 470 unique workers
+
completed over 44 thousand tasks. A followup survey
+
revealed that approximately 66% were female. We replicated
+
Study 1 on Amazon Mechanical Turk and found empirically
+
 
+
similar results, so we only report results using Clickworker in
+
this paper.
+
 
+
=== Method specifics and details ===
+
Primitive Crowdsourcing Task Types
+
We began by populating our evaluation tasks with common
+
crowdsourcing task types, or primitives, that appear commonly
+
as microtasks or parts of microtasks. To do this, we
+
looked at the types of tasks with the most available HITs
+
on Amazon Mechanical Turk, at reports on common crowdsourcing
+
task types [15], and at crowdsourcing systems described
+
in the literature (e.g., [4]). After several iterations
+
we identified a list of ten primitives that are present in most
+
crowdsourcing workflows (Table 1, Figure 2). For example,
+
the Find-Fix-Verify workflow [4] could be expressed using
+
a combination of the FIND (identify sentences which need
+
shortening), FIX (shortening these sentences), and BINARY
+
primitives (verifying the shortening is an improvement). In
+
many cases, the primitives themselves (or repetitions of the
+
same primitive) make up the entire task, and map directly to
+
common Mechanical Turk tasks (e.g., finding facts such as
+
phone numbers about individuals (SEARCH)).
+
We instantiated these primitives using a dataset of images of
+
people performing different actions (e.g., waving, cooking)
+
[34] and a corpus of translated Wikipedia articles selected because
+
they tend to contain errors [1].
+
 
+
=== Experimental Design for the study ===
+
We presented workers with a mixed series of tasks from the
+
ten primitives and manipulated two factors: the time limit
+
and the primitive. Each primitive had seven different possible
+
time limits, and one untimed condition. The exact time limits
+
were initialized using how long workers took when not under
+
time pressure. The result was a sampled, not fully-crossed,
+
design. For each worker we randomly selected five primitives
+
for them to perform; for each primitive, three questions of that
+
type were shown with each of the specified time limits. The
+
images or text used in these questions were randomly sampled
+
and shuffled for each worker. To minimize practice effects,
+
workers completed three timed practice questions prior
+
to seeing any of these conditions. The tasks were presented
+
in randomized order, and within each primitive the time conditions
+
were presented in randomized order. Workers were
+
compensated $2.00 and repeat participation was disallowed.
+
A single task was presented on each page, allowing us to
+
record how long workers took to submit a response. Under
+
timed conditions, a timer started as soon as the worker advanced
+
to the next page. Input was disabled as soon as the
+
timer expired, regardless of what the worker was doing (e.g.,
+
typing, clicking). An example task is shown in Figure 3.
+
 
+
=== Measures from the study ===
+
The information we logged allowed us to calculate behavioral
+
measures for each primitive:
+
– ETA. The ETA is the area under the error-time curve.
+
– Time@10. We also calculated the time it takes to achieve
+
an error rate at the 10th percentile.
+
– Error. We measured the error rate against ground truth
+
for each primitive. If there were many possible correct
+
responses, we manually judged responses while blind to
+
condition. Automatically computing distance metrics (e.g.,
+
edit distance) resulted in empirically similar findings.
+
– Time. We measured how long workers took to complete the
+
primitive without any time limit.
+
After each task block was complete, we additionally asked
+
workers to record several subjective reflections:
+
– Estimated time. We asked workers to report how long they
+
thought they spent on a primitive absent time pressure.
+
Time estimation has previously been used as an implicit
+
signal of task difficulty [5].
+
– Relative subjective duration (RSD). RSD, a measure of
+
how much task time is over- or underestimated [5], is obtained
+
by dividing the difference between estimated and
+
actual time spent by the actual time spent.
+
– Task load index (TLX). The NASA TLX [10] is a validated
+
metric of mental workload commonly used in human factors
+
research to assess task performance. It consists of a
+
survey that sums six subjective dimensions (e.g., mental
+
demand).
+
A separate experimental design that contained all ten primitives,
+
where each worker completed three untimed practice
+
questions followed by three untimed questions for each primtive
+
(with the primitives presented in random order), was used
+
to obtain the
+
– Subjective rank. Workers considered all of the primitives
+
they completed and ranked them in order of effort required.
+
As rankings produce sharper distinctions than individual ratings
+
[2], we consider subjective rank to represent our ground
+
truth ranking of the primitives. However, rank would not be a
+
deployable solution for requesters. Ranking means that workers
+
would need to test the new task against at least log(n)
+
of the primitives, incurring a large fixed overhead. Further,
+
ranking is ordinal, and cannot quantify small changes in effort.
+
In contrast, ETA is an absolute ranking, can measure
+
small changes in effort, and only needs to be measured for
+
the target task to compare it with other tasks.
+
  
 
=== What do we want to analyze? ===
 
=== What do we want to analyze? ===
Analysis
+
Analysis to be done
60 workers completed Study 1, with 30 performing each
+
20 requestor with the BirdView donut chart module to understand their experiences
primitive. We averaged our dependent measures across all
+
Feedback from the requestors will be taken into consideration and put in the revised versions of the design.
30 workers, and compared the ranking of primitives induced
+
by each measure to the average subjective ranking (subjective
+
rank was obtained by having 40 other workers rank all
+
ten primitives). We used the Kendall rank correlation coeffi-
+
cient to capture how closely each measure approximated the workers’ ranks, with Holm-corrected p-values calculated under
+
the null hypothesis of no association. A rank correlation
+
of 1 indicates perfect correlation; 0 indicates no correlation.
+
Measures that capture the subjective ranking accurately can
+
analyze new tasks types without comparing them against multiple
+
benchmark tasks.
+

Latest revision as of 20:13, 14 February 2016

Brief introduction of the system =

A system for the requestor to better view his overall work. To view the work given to workers and to monitor the progress of the tasks, and understand which task's workers need more clarifications to provide better feedback and inputs to that particular task.

How is the system solving critical problems

One problem which has always been bothering the workers is misunderstood instruction. When the worker misunderstands the instructions, the work done with dedication is also not going to meet the requirements. This leads to rejection of work and waste of time. With this system, the worker will be able to get immediate feedback as the requester gets a birds eye view of the workers progress in his works and also gets updates what are the tasks where workers face issues and provide feedback immediately to help better completion of tasks. This not only helps the worker who is stuck to complete the task but also helps other workers who are going to take up other such tasks by providing clear instructions based on the feedback and questions from previous workers.

Introducing modules of the system

There are three main contributions which this system helps improve.

1. BirdView - Overall view of the tasks 2. Better feedback for the clarified instructions for already working workers and upcoming workers 3. Improved worker task acceptance due to clear instructions and feedback

Module 1: BirdView - Overall view of the tasks

Problem/Limitations

With multiple tasks put on the crowdsourcing platform, the requestor sometimes loses track of the overall completion, feedback and comparisons of work completion in the tasks. This increases the time the requester uses the platform which can be minimised with a birds eye view of the tasks completed. This simplifies the interactions of the requester in monitoring and understanding the progress in all the works provided by the requestor, thereby giving the requestor more time to better concentrate on other tasks.

Module preview

Existing crowdsourcing task overviews consist of individual tasks overview and are not visually interactive enough to convey the information at one go. This needs to be addressed to provide time saving work management for the requestor.

System details

BirdView is a the overall view of tasks which the requestor has uploaded on the platform for other workers to contribute. This system creates an interactive overview of all the tasks which have been provided to the user. This interactive system allows the requestor to view all the works put together as a donut with daemo colors and when then requestor hovers over the tasks, then the requestor will be able to view the details of the task popping out. Along with this an additional feature of showing the number of comments and clarifications from the workers will also be shown as numbers over the donut for each task. This will help the requestor know the difficulties faced by the workers in some particular task and prioritise his feedback in those tasks.


Module 2: Better feedback for the clarified instructions for already working workers and upcoming workers

Problem/Limitations

With less attention from the requestor on the worker's problems, the workers continue to work with their understanding of the problem or stop working thinking the job might get rejected due to improper work. To avoid this, the better feedback system is created.

Module details

This feedback system is accessed when the requestor hovers over the work area in the donut chart. Once the requestor clicks over the number of comments of that particular task, he is provided with the list of questions and suggestions as a pop up and can start replying to the comments and also improve the instructions on the top of the pop up panel thus increasing worker understanding of the task and trust over the requestor.

Module 3: Improved worker task acceptance due to clear instructions and feedback

Problem/Limitations

This problem of new comers and senior workers not accepting the requestors work because of improper tasks understanding.

Module preview

The more workers accept the tasks due to better understanding of the task, the more the task completion rate and also with good quality which satisfies one of the most important requirements from the requestor.

What do we want to analyze?

Analysis to be done 20 requestor with the BirdView donut chart module to understand their experiences Feedback from the requestors will be taken into consideration and put in the revised versions of the design.