WinterMilestone 1 presrini

From crowdresearch
Jump to: navigation, search

Experience the life of a Worker on Mechanical Turk

Reflect on your experience as a worker on Mechanical Turk. What did you like? What did you dislike?

I am pretty new to using MTurk. So it took a while to understand how the website worked. I would sometimes be taken to a HIT next on the list on completion of the current HIT. I felt like the website had more control on deciding which HIT I had to work.

Experience the life of a Requester on Mechanical Turk

Reflect on your experience as a requester on Mechanical Turk. What did you like? What did you dislike? Also attach the CSV file generated when you download the HIT results.

Not a great experience as a requester. Publishing a batch comprising of sentiment analysis will automatically set duration to 60 minutes. Probably I was doing something wrong, but I felt the turker sandbox did not offer useful feedback to my actions. I almost always saw a page with "Your request was not completed successfully" Obviously, this feedback was not useful. I was forced to pick a different project type because I wanted to create a HIT that required 3 minutes to complete.

Explore alternative crowd-labor markets

Compare and contrast the crowd-labor market you just explored (TaskRabbit/oDesk/GalaxyZoo) to Mechanical Turk.

I tried classifying features in the images of galaxies on the GalaxyZoo website. I felt like this website was better than MTurk - more focused on one type of task (classification by responding to questions on an image), provided useful feedback, and sometimes motivated me to keep going. Although I am not an expert in galaxies, I was provided with useful hints while responding to the questions. Restarting a classification task was however not very user friendly. Each image had between 3-5 questions to which I had to respond. For each image, if I thought I wanted to provide a different answer to a previous question, I had to start from the very first question.

Image classification
GalaxyZoo Image classification
Description for the terms used in the question can be obtained by clicking on Examples button
Galaxyzoo: Description for the terms used in the question can be obtained by clicking on Examples button
Motivational feedback
Motivational feedback on GalaxyZoo

Readings

MobileWorks

  • What do you like about the system / what are its strengths?

Usage of web-based technique to complete micro OCR tasks on a regular cellphone is a very interesting idea. While this idea seems to work very well for OCR, it is not clear how other kinds of tasks, like for instance, image categorization or video transcription work? I feel like this will depend on the hardware capabilities of the cellphone as well.

  • What do you think can be improved about the system?

The MobileWorks systems is described as comprising of three components - where the second component is where the actual digitization or micro work is occurring. It is not clear who performs the processing of the image or who cuts up the image. Does the application automatically load an image and generate small pieces that are later assigned as micro tasks? Further, who puts together the digitized information? Does the application compile and arrange all the micro tasks to occur sequentially as the words would in the image?

It is also not clear what would happen if multiple people transcribed the image incorrectly. When does MobileWorks stop checking for correctness of the transcription?

Daemo

  • What do you like about the system / what are its strengths?

I like the prototype task feature of Daemo, where the tasks can be refined for clarity based on worker feedback. I feel this is a very interesting feature, especially when the requester is not aware of the outcome. For instance, a requester can create a survey of questions that are not necessarily standardized and obtain worker feedback on the wording of the questions. Oftentimes, when a survey with questions that are framed for the first time can be ambiguous to workers. In such cases, I think the prototype test feature can come in handy with refining the wording. I am however not sure how Daemo decides when to stop the refinement iteration. Further, given the scenario as above, how will the requester know a worker's feedback on the survey questions are of high quality? It is also not clear as to how many workers are good enough to help provide feedback at each iteration.

  • What do you think can be improved about the system?

In addition to requesters ranking the workers, I feel it will help if Daemo constantly learns from its workers and categorizes or creates personas for each worker. For instance, a worker who is able to complete a categorization task in less than 50% of the estimated task duration may be categorized as "Pro worker", worker who is able to complete the same task between 50-75% may be categorized as "Avg worker", worker who is able to complete the same task between 75-100% task duration may be categorized as "beginner" and so on. There can also be other factors that can be considered while associating personas in addition to time taken to complete a task - like time since the worker joined Daemo, number of tasks completed so far, overall average ranking provided by all the requesters for whom the worker has completed tasks, personality traits of the worker and so on.

The Boomerang feature is an interesting feature to bring back high-quality workers for future tasks through a ranking system. Extrinsically, a worker can see tasks from the requester she/he highly rated in the past, however, it is not clear what factor motivates the worker to go back to completing a task for the same requester? Of course, payment may be considered as an extrinsic motivator. But is there an intrinsic motivator involved that we are missing?

The article mentions how a requester can rate the worker based on work quality. How will the requester judge the work quality for tasks that do not have a correct/incorrect response? For instance, assume the requester has posted a Likert-type survey question "I feel anxious when I have misplaced my smartphone" with responses ranging from 1 to 5 (strongly disagree to strong agree). In this scenario, the response can be anything based on several factors related to the worker's individuality trait, associations with smartphone etc. It is not clear how a requester can identify high quality work, even though there can be an exploratory hypothesis or research question. Possibly for such scenarios, if Daemo can provide an overall average or frequency count computed from all the others workers who also responded to the same question as a comparison tool, then the requester can identify how close/far away the worker's response is as compared with the overall average. For instance, if the requester can see that out of 99 other workers who also responded to the same question, 55 reported strongly agree, 30 reported agree, and 5 reported neutral, then the worker's response of strongly disagree might be an outlier. Obviously, this doesn't mean the responses by other workers are correct. But this comparison can provide a context to the requester while deciding on the quality of response.

The workers' task list is ordered based on her/his rating of requesters for tasks that were completed in the past. Why not order the tasks also based on the worker preferences, abilities, and experience? Worker preferences can be a one-time setting that is done while creating a Daemo account and can include questions on personal interest in a subject matter, preferred task length or pay/task and so on. Again going back to my discussion on personas, these factors can also be used to categorize the types of workers.

  • Few other questions I have in general on reading the paper

The article mentions how rating a requester will allow a worker to continue to work for the same requester. Why is this a requirement? What will happen if the worker does not have the experience in completing the certain future tasks posted by the same requester?

It is not clear from the article how the reputation score for a task is computed?

The article states that a worker can rate a requester based on task clarity, payment, and communication. I feel like it will help if task difficulty (as perceived and reported as an approximate estimate by the requester while creating the task) in relation to the estimated task duration can also be factored in.

The article discusses the cascaded release of tasks based on rating. Why doesn't this include the duration for which the task will be open/active? What will happen if the requester posts a task to be completed by 50 workers within 2 days?

Isn't ranking extra work both for workers and requesters? I am wondering if there is some way this step can be mediated by Daemo.


Milestone Contributors

Slack usernames of all who helped create this wiki page submission: @presrini