Winter Milestone 5 Templates

From crowdresearch
Revision as of 11:31, 10 February 2016 by Rajan (Talk | contribs)

Jump to: navigation, search

Please use the following template to write up your introduction section this week.


We're going to borrow introduction section from this paper as an example: Vaish R, Wyngarden K, Chen J, et al. Twitch crowdsourcing: crowd contributions in short bursts of time. Proceedings of the 32nd annual ACM conference on Human factors in computing systems. ACM, 2014: 3645-3654. Please note how this section was divided into different parts. Please follow the same template.

Brief introduction of the system

Twitch is an Android application that appears when the user presses the phone’s power/lock button (Figures 1 and 3). When the user completes the twitch crowdsourcing task, the phone unlocks normally. Each task involves a choice between two to six options through a single motion such as a tap or swipe.

How is the system solving critical problems

To motivate continued participation, Twitch provides both instant and aggregated feedback to the user. An instant feedback display shows how many other users agreed via a fadeout as the lock screen disappears (Figure 4) or how the user’s contributions apply to the whole (Figure 5). Aggregated data is also available via a web application, allowing the user to explore all data that the system has collected. For example, Figure 2 shows a human generated map from the Census application. To address security concerns, users are allowed to either disable or keep their existing Android passcode while using Twitch. If users do not wish to answer a question, they may skip Twitch by selecting ‘Exit’ via the options menu. This design decision has been made to encourage the user to give Twitch an answer, which is usually faster than exiting. Future designs could make it easier to skip a task, for example through a swipe-up.

Introducing modules of the system

Below, we introduce the three main crowdsourcing applications that Twitch supports. The first, Census, attempts to capture local knowledge. The following two, Image Voting and Structuring the Web, draw on creative and topical expertise. These three applications are bundled into one Android package, and each can be accessed interchangeably through Twitch's settings menu.

Module 1: Census


Despite progress in producing effective understanding of static elements of our physical world — routes, businesses and points of interest — we lack an understanding of human activity. How busy is the corner cafe at 2pm on Fridays? What time of day do businesspeople clear out of the downtown district and get replaced by socializers? Which neighborhoods keep high-energy activities going until 11pm, and which ones become sleepy by 6pm? Users could take advantage of this information to plan their commutes, their social lives and their work. Existing crowdsourced techniques such as Foursquare are too sparse to answer these kinds of questions: the answers require at-the-moment, distributed human knowledge. We envision that twitch crowdsourcing can help create a human-centered equivalent of Google Street View, where a user could browse typical crowd activity in an area. To do so, we ask users to answer one of several questions about the world around them each time they unlock their phone. Users can then browse the map they are helping create. Census is the default crowdsourcing task in Twitch. It collects structured information about what people experience around them. Each Census unlock screen consists of four to six tiles (Figures 1 and 3), each task centered around questions such as: • How many people are around you? • What kinds of attire are nearby people wearing? • What are you currently doing? • How much energy do you have right now? While not exhaustive, these questions cover several types of information that a local census might seek to provide. Two of the four questions ask users about the people around them, while the other two ask about users themselves; both of which they are uniquely equipped to answer. Each answer is represented graphically; for example, in case of activities, users have icons for working, at home, eating, travelling, socializing, or exercising. To motivate continued engagement, Census provides two modes of feedback. Instant feedback (Figure 4) is a brief Android popup message that appears immediately after the user makes a selection. It reports the percentage of responses in the current time bin and location that agreed with the user, then fades out within two seconds. It is transparent to user input, so the user can begin interacting with the phone even while it is visible. Aggregated report allows Twitch users to see the cumulative effect of all users’ behavior. The data is bucketed and visualized on a map (Figure 2) on the Twitch homepage. Users can filter the data based on activity type or time of day. Photo Ranking Beyond harnessing local observations via Census, we wanted to demonstrate that twitch crowdsourcing could support traditional crowdsourcing tasks such as image ranking (e.g., Matchin [17]). Needfinding interviews and prototyping sessions with ten product design students at Stanford University indicated that product designers not only need photographs for their design mockups, but they also enjoy looking at the photographs. Twitch harnesses this interest to help rank photos and encourage contribution of new photos. Photo Ranking crowdsources a ranking of stock photos for themes from a Creative Commons-licensed image library. The Twitch task displays two images related to a theme (e.g., Nature Panorama) per unlock and asks the user to slide to select the one they prefer (Figure 1). Pairwise ranking is considered faster and more accurate than rating [17]. The application regularly updates with new photos. Users can optionally contribute new photos to the database by taking a photo instead of rating one. Contributed photos must be relevant to the day’s photo theme, such as Nature Panorama, Soccer, or Beautiful Trash. Contributing a photo takes longer than the average Twitch task, but provides an opportunity for motivated individuals to enter the competition and get their photos rated. Like with Census, users receive instant feedback through a popup message to display how many other users agreed with their selection. We envision a web interface where all uploaded images can be browsed, downloaded and ranked. This data can also connect to computer vision research by providing high-quality images of object categories and scenes to create better classifiers. Structuring the Web Search engines no longer only return documents — they now aim to return direct answers [6,9]. However, despite massive undertakings such as the Google Knowledge Graph [36], Bing Satori [37] and Freebase [7], much of the knowledge on the web remains unstructured and unavailable for interactive applications. For example, searching for ‘Weird Al Yankovic born’ in a search engine such as Google returns a direct result ‘1959’ drawn from the knowledge base; however, searching for the equally relevant ‘Weird Al Yankovic first song’, ‘Weird Al Yankovic band members’, or ‘Weird Al Yankovic bestselling album’ returns a long string of documents but no direct answer, even though the answers are readily available on the performer’s Wikipedia page. To enable direct answers, we need structured data that is computer-readable. While crowdsourced undertakings such as Freebase and dbPedia have captured much structured data, they tend to only acquire high-level information and do not have enough contributors to achieve significant depth on any single entity. Likewise, while information extraction systems such as ReVerb [14] automatically draw such information from the text of the Wikipedia page, their error rates are currently too high to trust. Crowdsourcing can help such systems identify errors to improve future accuracy [18]. Therefore, we apply twitch crowdsourcing to produce both structured data for interactive applications and training data for information extraction systems. Contributors to online efforts are drawn to goals that allow them to exhibit their unique expertise [2]. Thus, we allow users to help create structured data for topics of interest. The user can specify any topic on Wikipedia that they are interested in or want to learn about, for example HCI, the Godfather films, or their local city. To do so within a oneto-two second time limit, we draw on mixed-initiative information extraction systems (e.g., [18]) and ask users to help vet automatic extractions. When a user unlocks his or her phone, Structuring the Web displays a high-confidence extraction generated using ReVerb, and its source statement from the selected Wikipedia page (Figure 1). The user indicates with one swipe whether the extraction is correct with respect to the statement. ReVerb produces an extraction in SubjectRelationship-Object format: for example, if the source statement is “Stanford University was founded in 1885 by Leland Stanford as a memorial to their son”, ReVerb returns {Stanford University}, {was founded in}, {1885} and Twitch displays this structure. To minimize cognitive load and time requirements, the application filters only include short source sentences and uses color coding to match extractions with the source text. In Structuring the Web, the instant feedback upon accepting an extraction shows the user their progress growing a knowledge tree of verified facts (Figure 5). Rejecting an extraction instead scrolls the user down the article as far as their most recent extraction source, demonstrating the user’s progress in processing the article. In the future, we envision that search engines can utilize this data to answer a wider range of factual queries.


We're going to borrow introduction section from this paper as an example: Cheng, J., Teevan, J. & Bernstein, M.S. (2015). Measuring Crowdsourcing Effort with Error-Time Curves. CHI 2015.. Please note how this section was divided into different parts. Please follow the same template.

Phenomenon you're interested in

Imagine that a requester wants to use Amazon Mechanical Turk to label 10,000 images with a fixed set of tags. How much should workers be paid to label each image? Would labeling an image with twice as many tags result in a task that is twice as much effort? Should the tags be provided in a drop down list or with radio buttons? Answering these questions requires a fine-grained understanding of the amount of effort the task requires. This process today involves trial and error: requesters observe the wait time and quality on test tasks, guess what might have been causing any problems, tweak the task, and repeat. An accurate measure of the effort required to complete a crowdsourced task would enable requesters to

compare different approaches to their tasks, iterate toward a better design, and price their tasks objectively. It could also help workers decide whether to accept a task, or even allow systems to offer tasks based on difficulty or time availability. However, despite its potential value, task effort is challenging to estimate. Workers face cognitive biases in assessing diffi- culty [21], while requesters cannot easily observe the process and, as experts, categorically underestimate completion times [12]. These limits suggest the need for a behavioral approach to measure effort. One approach might be to let the market identify hard tasks by reacting to the posted price [30].

The puzzle (observations we can't account for yet)

However, prices cannot easily make fine distinctions in an inelastic market such as Mechanical Turk [14]. Another approach might be to use task duration as a signal of difficulty, but this is unreliable because workers regularly accept multiple tasks simultaneously and interleave work [29]. Measures such as reaction time [32] are not easy to apply to typical crowd tasks: reaction time metrics tend to use simplistic tasks (e.g., shape or color recognition), while others may be too involved for crowd work (e.g., [9]).

Experimental design

In this paper, we propose a data-driven behavioral measure of effort that can be easily and cheaply calculated using the crowd. Our metric, the error time area (ETA), draws on cognitive psychology literature on speed-accuracy tradeoff curves [32], and represents the effort required for a worker to accurately complete a task. To create it, we first recruit workers to complete the task under different time limits. Next, we fit a curve to the collected data relating the error rate and time limit (Figure 1). Last, we compute ETA by taking the area under this error-time curve. Because ETA is calculated using a data-driven approach, task difficulty can be determined with minimal effort and without analytical modeling. Rather than measuring average duration independent of work quality, ETA computes quality as a function of duration and thus can be used to estimate a wage for a task. ETA also allows requesters to compare multiple task designs; for example, we find that tagging an image with an open textbox is less effort than choosing between a fixed list of 16 options, but more effort than choosing between a fixed list of 8 options.

Evaluation methods

After describing ETA, we explore the metric via four studies: – Study 1: ETA vs. other measures of effort. For ten common microtasking primitives (e.g., multiple choice questions, long-form text entry), we show that the ETA metric represents effort better than existing measures. – Study 2: ETA vs. market price. We then compare ETA as well as other measures to the market prices of these primitives on a crowdsourcing platform. – Study 3: Modeling perceptual costs. By augmenting ETA with measures of perceptual effort, we find we can better model a worker’s perceived difficulty of a task. – Study 4: Tasks without ground truth. In order to capture how well people do a task, ETA requires ground truth. We extend the metric to also work for subjective tasks.

The Result

We then demonstrate how ETA can be used for rapidly prototyping tasks. ETA makes it possible to characterize tasks in terms of their monetary cost and human effort, and paves the way for better task design, payment, and allocation.