The Problem
We here at BigOven.com have 170,000 recipes online, accessible via the web and iPhone and Android applications. And one of our primary jobs is to help cooks around the world get to the right recipe as quickly as possible.
Often, the best way to do that uses free-form text-search. To that end, we’ve long had recipe tagging, which allows people to assign their own tags (like “Low Carb”, “Spicy”, “Winter”) to recipes, and people can search on them. For instance, to find all “spicy chicken” recipes, just visit BigOven.com, change the search target to “anywhere” and enter “spicy chicken”. You’ll get this page: http://www.bigoven.com/recipes/any/spicy%20chicken, which shows some great options.
But we’ve never really undertaken the project to say “which one category would this recipe belong in?”. In other words “If you had to choose, which folder would you place this recipe in, in a filing cabinet?”
Why is that important? Well, for starters, a rigid hierarchy makes browseability of the data set much easier, less confusing, and more accurate.
While a recipe like Caesar Salad might be legitimately tagged with “Appetizer”, “Main Course” and “Salad”, a new user searching simply on “Main Dishes” might be surprised to see a salad in there.
Strict categorization can also help with Search Engine Optimization, because the internal browse mechanism of the site improves (see the links, highlighted in yellow in the picture to the right, as just one example).
Categorization can also show you, among your Favorite Recipes, which are Desserts and Appetizers. And in this way, it can even help us get to know your food preferences better – e.g., are you a “Pancakes” or “Waffles” household? Which types of recipes are you Favoriting/Trying/Viewing the most, and when?
It can help with very handy Search Options on both the mobile applications and the website.
So, we’ve had this need to go through every single one of the 170,000+ recipes and put them into categories.
This is a fairly easy task for any recipe n, requiring perhaps 10-30 seconds per recipe. The problem is, we’re not a gigantic team, and count(n)=172,000+. If we hired a few college interns for this task, if they took 30 seconds per recipe, if they worked flat-out without any lunch breaks, we’re looking at half a man-year before this task would be completed.
The Solution: Mechanical Turk To The Rescue
Enter the On Demand Workforce, specifically embodied by Amazon.com Mechanical Turk. Workers around the world are logging into Mechanical Turk, checking for discrete tasks, and doing them. Once their work is approved, they get paid by Amazon.com (which was previously paid by the task creator).
We simply created a Human Interface Task (HIT) to help us categorize these recipes into a main and subcategory, and BOOM -- the work should be completed – and completed fairly well -- within 10 days.
Tools like SmartSheet can help you create these tasks quickly, but since we’re geeks over here, we wanted a little more control and rolled our own, using the Amazon.com Mechanical Turk SDK, specifically the .NET tools for Mechanical Turk.
We created a simple web form that displays a recipe and asks them to categorize it. We funded our Amazon.com account with some money. Then, we created a command line tool to post these HITs to Amazon’s platform.
Controlling Quality
We got the categorization platform developed fairly quickly (2-3 hours).
But the biggest stumbling block once the first few hundred were published became clear right away: of the first 100 categorizations, about 20% of them were wrong. We checked, and 2 of the 14 workers somewhere in the world were just gaming the system, earning money by just clicking random categorizations and pressing the submit button. (Is “Mom’s Banana Bread” a “Green Salad”? I think not.)
We rejected payment and nulled out those categorizations, but we don’t have time or money to manually review every single one of these. (Put another way, we’d have to pass this cost on to the consumer, and we like being nicely cashflow positive while keeping much of what we offer free.)
How do you remove the bad actors and ensure higher quality? There are a few methods:
- Method 1: Issue the task multiple times, and pay only when they match. This works, and works very well, but in areas where judgment is involved, we think it can be a bit unfair to the worker.
One person might think Bourbon-flavored Pecan Pie belongs in “Desserts > Liquor-flavored Desserts” and another might think it belongs in “Desserts > Pies”, and both would have a fair argument. Do you want to penalize both for trying? Also, it’s expensive, because you have to pay for both winning answers. - Method 2: Insist upon a passing grade on a pre-qualification test.
This is a method Amazon.com encourages, and it is a common method for Mechanical Turk Requestors. The problem here is that anyone who wants to game the system still can, by “playing nice” during the test and then randomly categorizing later. - Method 3: In-line control tests. That is – intermittently supply tests that have known, correct answers (and of course pay for these answers, even though you know them).
Keep score. If the worker gets too many of them wrong, then block them from further work. Ask them to contact you for manual review, and do so quickly when they do. Unban them if they look good and it’s a judgement call, but leave their ban in place if they are clearly bad actors. This is the method we’re using, and it’s worked very, very well. We’re now only paying for the good results, and we’re happy to do so. In fact, this new method may allow us to raise our payment rates a bit for future projects, since $$ spent on waste has gone down.
Results
At peak times, we’ve had over 300 “turkers” around the world working on recipe categorization. Results come flying in, sometimes 10 or 20 every second. And the results have been surprisingly good, especially after we took the step mentioned in #3 above.
Thanks to Turkers around the world, BigOven can now be much more easily navigated. Try it yourself – let’s look for a pasta salad on BigOven.com. Simply visit BigOven.com, click on the left hand sidebar for “Salad”. Then choose “[search options]” and change the subcategory to “Pasta”. Boom! You’re looking at pasta salads.
You can even then add ingredient keywords to explicitly include or exclude.
As I write this, more categorization work is being completed, but it’s going very quickly and we should be done in a few days.
Reflection
It is hard to communicate just how revolutionary it feels, when you launch the HITs on Amazon.com from your computer, and then you see, literally just seconds later, very good and correct results come in, as if by magic.
While yes, there is a whole separate political discussion to be had about this, from an employer’s standpoint, this is a task we would not have done under the traditional methods (we would have continued to pass on doing this work), so it is new work-for-hire that would not have existed. And while payment is issued to the workers who do it, we the employer get to skip over the creating and posting of job offers, the interviewing of folks, the arranging for a workspace, getting them connected to the network, and yes, regrettably letting them go when the task is complete.
Instead, we go straight to explaining the task, providing a form, pre-paying for results and launching it.
Not only do we not have to find, hire, train (and eventually let go) temporary workers for a massive data cleansing task, we do not have to wait very long to measure results, tune the process, and have our entire database to be made much better.
It is revolutionary not just because it allows costs to be reduced, but it allows a heretofore serial process to be massively parallelized, which means things can be done faster, less expensively (and therefore at lower cost – is there lower than free? -- to our users) than ever before. Because of these structural (i.e., parallelizing) and significant cost changes, approaches can be taken to all kinds of problems that weren’t possible just 20 years ago.
And for us here at BigOven.com, since we have a ton of data and give away a lot of value for free, it means a lot.
Another verification approach that I have used is to ask the real question and then ask a second turker a question with the answer provided in the first and have them verify. This is similar to method 1 above but has the advantages of 1) allowing for the valid arguments called out in method 1 and 2) that the verification question, being a yes or no type, is cheaper to run.
Posted by: Frank Paterra | January 15, 2010 at 06:54 AM
Fantastic summary on the wonders and challenges associated with the de-facto micro-task oriented paid crowdsourcing solution - MTurk. Both the NYT and the FT heralded Labor-as-a-Service as a top trend benefiting startups in 2010. Another emerging player in the space is CloudCrowd.com. While MTurk takes a hands-off approach, CloudCrowd provides a customized full-service solution. CloudCrowd is not a simple Mechanical Turk overlay. It is its own distinct work distribution platform powered by its own crowd of motivated workers, eager to complete projects both similar and dissimilar to to those that Steve completed for BigOven.com.
Posted by: Alec Dinner | January 19, 2010 at 04:11 PM