A Blog by Jonathan Low

 

Jun 4, 2012

Can Data Tame Risk?

'In God we trust. All others bring data.'

That bon mot, now widely quoted, originated with W. Edwards Deming, who was arguably America's greatest statistician.

The irony is that Deming was a prophet without honor in his own country. His penchant for precision and prickly insistence on facts made many executives of his day uncomfortable. Most preferred camaraderie and the judgment with which they believed their experience had blessed them. 'You can never get fired for buying IBM,' was another expression of this mindset.

So Deming went where he was wanted - and needed. In the late 50s and early 60s, that was Japan. A decade or so later, that country's manufacturers began their determined and largely successful assault on the global economy.

Since that time data has enjoyed an honored place in the pantheon of business decision making. Today, the notion of Big Data dominates discussion of the future for those managing the tech/business interface. And that is, by and large, a good thing.

There is just one issue; successful application of the wisdom embedded in data requires judgment. So the interpretation of data depends upon careful analysis, a keen eye for anomalies and a skeptical attitude about anything smacking of certainty or the notion that 'forever' is a logical concept.

As development of tech products becomes faster, cheaper and more powerful, this approach puts ever greater pressure on those charged with making larger decisions with greater financial implications in less time. This is the story of one such set decision-making criteria. It's ultimate veracity has yet to be determined, but its application can be evaluated in context now. JL

Brian Christian reports in Wired:
Over the past decade, the power of A/B testing has become an open secret of high-stakes web development. It’s now the standard (but seldom advertised) means through which Silicon Valley improves its online products. Using A/B, new ideas can be essentially focus-group tested in real time: Without being told, a fraction of users are diverted to a slightly different version of a given web page and their behavior compared against the mass of users on the standard site. If the new version proves superior—gaining more clicks, longer visits, more purchases—it will displace the original; if the new version is inferior, it’s quietly phased out without most users ever seeing it.

A/B allows seemingly subjective questions of design—color, layout, image selection, text—to become incontrovertible matters of data-driven social science.
Dan Siroker helps companies discover tiny truths, but his story begins with a lie. It was November 2007 and Barack Obama, then a Democratic candidate for president, was at Google’s headquarters in Mountain View, California, to speak. Siroker—who today is CEO of the web-testing firm Optimizely, but then was a product manager on Google’s browser team—tried to cut the enormous line by sneaking in a back entrance. “I walked up to the security guard and said, ‘I have to get to a meeting in there,’” Siroker recalls. There was no meeting, but his bluff got him in.

At the talk, Obama fielded a facetious question from then-CEO Eric Schmidt: “What is the most efficient way to sort a million 32-bit integers?” Schmidt was having a bit of fun, but before he could move on to a real question, Obama stopped him. “Well, I think the bubble sort would be the wrong way to go,” he said—correctly. Schmidt put his hand to his forehead in disbelief, and the room erupted in raucous applause. Siroker was instantly smitten. “He had me at ‘bubble sort,’” he says. Two weeks later he had taken a leave of absence from Google, moved to Chicago, and joined up with Obama’s campaign as a digital adviser.

At first he wasn’t sure how he could help. But he recalled something else Obama had said to the Googlers: “I am a big believer in reason and facts and evidence and science and feedback—everything that allows you to do what you do. That’s what we should be doing in our government.” And so Siroker decided he would introduce Obama’s campaign to a crucial technique—almost a governing ethos—that Google relies on in developing and refining its products. He showed them how to A/B test

After joining the Obama campaign, Siroker used A/B to rethink the basic elements of the campaign website. The new-media team already knew that their greatest challenge was turning the site’s visitors into subscribers—scoring an email address so that a drumbeat of campaign emails might eventually convert them into donors. Their visit would start with a splash page—a luminous turquoise photo of Obama and a bright red “Sign Up” button. But too few people clicked the button. Under Siroker’s tutelage, the team approached the problem with a new precision. They broke the page into its component parts and prepared a handful of alternatives for each. For the button, an A/B test of three new word choices—”Learn More,” “Join Us Now,” and “Sign Up Now”—revealed that “Learn More” garnered 18.6 percent more signups per visitor than the default of “Sign Up.” Similarly, a black-and-white photo of the Obama family outperformed the default turquoise image by 13.1 percent. Using both the family image and “Learn More,” signups increased by a thundering 40 percent.

Most shocking of all to Obama’s team was just how poorly their instincts served them during the test. Almost unanimously, staffers expected that a video of Obama speaking at a rally would handily outperform any still photo. But in fact the video fared 30.3 percent worse than even the turquoise image. Had the team listened to instinct—if it had kept “Sign Up” as the button text and swapped out the photo for the video—the sign-up rate would have slipped to 70 percent of the baseline. (“Assumptions tend to be wrong,” as Siroker succinctly puts it.) And without the rigorous data collection and controls of A/B testing, the team might not even have known why their numbers had fallen, chalking it up perhaps to some decline in enthusiasm for the candidate rather than to the inferior site revamp. Instead, when the rate jumped to 140 percent of baseline, the team knew exactly what, and whom, to thank. By the end of the campaign, it was estimated that a full 4 million of the 13 million addresses in the campaign’s email list, and some $75 million in money raised, resulted from Siroker’s careful experiments.

A/B testing was a new insight in the realm of politics, but its use on the web dates back at least to the turn of the millennium. At Google—whose rise as a Silicon Valley powerhouse has done more than anything else to spread the A/B gospel over the past decade—engineers ran their first A/B test on February 27, 2000. They had often wondered whether the number of results the search engine displayed per page, which then (as now) defaulted to 10, was optimal for users. So they ran an experiment. To 0.1 percent of the search engine’s traffic, they presented 20 results per page; another 0.1 percent saw 25 results, and another, 30.

Due to a technical glitch, the experiment was a disaster. The pages viewed by the experimental groups loaded significantly slower than the control did, causing the relevant metrics to tank. But that in itself yielded a critical insight—tenths of a second could make or break user satisfaction in a precisely quantifiable way. Soon Google tweaked its response times and allowed real A/B testing to blossom. In 2011 the company ran more than 7,000 A/B tests on its search algorithm. Amazon.com, Netflix, and eBay are also A/B addicts, constantly testing potential site changes on live (and unsuspecting) users.

Today, A/B is ubiquitous, and one of the strange consequences of that ubiquity is that the way we think about the web has become increasingly outdated. We talk about the Google homepage or the Amazon checkout screen, but it’s now more accurate to say that you visited a Google homepage, an Amazon checkout screen. What percentage of Google users are getting some kind of “experimental” page or results when they initiate a search? Google employees I spoke with wouldn’t give a precise answer—”decent,” chuckles Scott Huffman, who oversees testing on Google Search. Use of a technique called multivariate testing, in which myriad A/B tests essentially run simultaneously in as many combinations as possible, means that the percentage of users getting some kind of tweak may well approach 100 percent, making “the Google search experience” a sort of Platonic ideal: never encountered directly but glimpsed only through imperfect derivations and variations.

Still, despite its widening prevalence, the technique is not simple. It takes some fancy technological footwork to divert user traffic and rearrange a site on the fly; segmenting users and making sense of the results requires deep knowledge of statistics. This is a barrier for any firm that lacks the resources to create and adjudicate its own tests. In 2006 Google released its Website Optimizer, which provided a free tool for anyone who wanted to run A/B tests. But the tool required site designers to create full sets of code for both A and B—meaning that nonprogrammers (marketing, editorial, or product people) couldn’t run tests without first taxing their engineers to write multiple versions of everything. Consequently there was a huge delay in getting results as companies waited for the code to be written and go live.

In 2009 this remained a problem in need of a solution. After the Obama campaign ended, Siroker was left amazed at the efficacy of A/B testing but also at the paucity of tools that would make it easily accessible. “The thought of using the tools we used then made me grimace,” he says. By the end of the year, Siroker joined forces with another ex-Googler, named Pete Koomen, and they launched a startup with the goal of bringing A/B tools to the corporate masses, dubbing it Optimizely. They signed up their first customer by accident. “Before we even spent much time working on the product,” Siroker explains, “I called up one of the guys from the Obama campaign, who had started up a digital marketing firm. I told him what I was up to, and about 20 minutes in, he suddenly said, ‘Well, that sounds great. Send me an invoice.’ He thought it was a sales call.”

The pair had made a sale, but they still didn’t have a product. So Siroker and Koomen started coding. Unlike the earlier A/B tools, they designed Optimizely to be usable by nonprogrammers, with a powerful graphical interface that lets clients drag, resize, retype, replace, insert, and delete on the fly. Then it tracks user behavior and delivers results. It’s an intuitive platform that offers the A/B experience, previously the sole province of web giants like Google and Amazon, to small and midsize companies—even ones without a hardcore engineering or testing team.

What this means goes way beyond just a nimbler approach to site design. By subjecting all these decisions to the rule of data, A/B tends to shift the whole operating philosophy—even the power structure—of companies that adopt it. A/B is revolutionizing the way that firms develop websites and, in the process, rewriting some of the fundamental rules of business.

Here are some of these new principles.

Choose everything.
The online payment platform WePay designed its entire homepage through a testing process. “We did it as a contest,” CEO Bill Clerico says. “A few of our engineers built different homepages, and we just put them in rotation.” For two months, every user that came to WePay.com was randomly assigned a homepage, and at the end the numbers made the decision.

In the past, that exercise would have been impossible—and because it was impossible, the design would have emerged in a completely different way. Someone in the company, perhaps Clerico himself, would have wound up choosing a design. But with A/B testing, WePay didn’t have to make a decision. After all, if you can test everything, then simply choose all of the above and let the customers sort it out.

For that same reason, A/B increasingly makes meetings irrelevant. Where editors at a news site, for example, might have sat around a table for 15 minutes trying to decide on the best phrasing for an important headline, they can simply run all the proposed headlines and let the testing decide. Consensus, even democracy, has been replaced by pluralism—resolved by data.

The mantra of “choose everything” also becomes a way for companies to test out relationships with other companies—and in so doing becomes a powerful way for them to win new business and take on larger rivals. In 2011 a fund-raising site called GoFundMe was talking with WePay about the possibility of switching to its service from payment giant PayPal. GoFundMe CEO Brad Damphousse was open about his dissatisfaction with PayPal’s service; WePay responded, as startups usually do, by claiming that its product solved all the problems that plagued its larger competitor. “Of course we were skeptical and didn’t really believe them,” Damphousse recalls with a laugh.

But using A/B, WePay could present Damphousse with an irresistible proposition: Give us 10 percent of your traffic and test the results against PayPal in real time. It was an almost entirely risk-free way for the startup to prove itself, and it paid off. After Damphousse saw the data on the first morning, he switched half his traffic by the afternoon—and all of it by the next day.

Data makes the call.
Google insiders, and A/B enthusiasts more generally, have a derisive term to describe a decisionmaking system that fails to put data at its heart: HiPPO—”highest-paid person’s opinion.” As Google analytics expert Avinash Kaushik declares, “Most websites suck because HiPPOs create them.”

Tech circles are rife with stories of the clueless boss who almost killed a project because of a “mere opinion.” In Amazon’s early days, developer Greg Linden came up with the idea of giving personalized “impulse buy” recommendations to customers as they checked out, based on what was in their shopping cart. He made a demo for the new feature but was shot down. Linden bristled at the thought that the idea might not even be tested. “I was told I was forbidden to work on this any further. It should have stopped there.”

Instead Linden worked up an A/B test. It showed that Amazon stood to gain so much revenue from the feature that all arguments against it were instantly rendered null by the data. “I do know that in some organizations, challenging an SVP would be a fatal mistake, right or wrong,” Linden wrote in a blog post on the subject. But once he’d done an objective test, putting the idea in front of real customers, the higher-ups had to bend. Amazon’s culture wouldn’t allow otherwise.

Siroker recalls similar shifts during his time with the Obama campaign. “It started as a pretty political environment—where, as you can imagine, HiPPO syndrome reigned supreme. And I think over time people started to see the value in taking a step back and saying, ‘Well, here’s three things we should try. Let’s run an experiment and see what works. We don’t know.’”

This was the culture that he had come from at Google, what you might call a democracy of data. “Very early in Google’s inception,” Siroker explains, “if an engineer had an idea and had the data to back it up, it didn’t matter that they weren’t the VP of some business unit. They could make a case. And that’s the culture that Google believed in from the beginning.” Once adopted, that approach will beat the HiPPOs every time, he says. “A/B will empower a whole class of businesses to say, ‘We want to do it the way Google does it. We want to do it the way Amazon does it.’”

Says WePay’s Bill Clerico: “On Facebook, under the heading of Religious Views, my profile says: ‘In God we trust. All others, bring data.’”

The risk is making only tiny improvements.
One consequence of this data-driven revolution is that the whole attitude toward writing software, or even imagining it, becomes subtly constrained. A number of developers told me that A/B has probably reduced the number of big, dramatic changes to their products. They now think of wholesale revisions as simply too risky—instead, they want to break every idea up into smaller pieces, with each piece tested and then gradually, tentatively phased into the traffic.

But this approach, and the mindset that comes with it, has its own dangers. Companies may protect themselves against major gaffes but risk a kind of plodding incrementalism. They may find themselves chasing “local maxima”—places where the A/B tests might create the best possible outcome within narrow constraints—instead of pursuing real breakthroughs. Google’s Scott Huffman cites this as one of the greatest dangers of a testing-oriented mentality: “One thing we spend a lot of time talking about is how we can guard against incrementalism when bigger changes are needed. It’s tough, because these testing tools can really motivate the engineering team, but they also can wind up giving them huge incentives to try only small changes. We do want those little improvements, but we also want the jumps outside the box.” Paraphrasing a famous Henry Ford maxim—”If I’d asked my customers what they wanted, they’d have said a faster horse”—Huffman adds, “If you rely too much on the data, you never branch out. You just keep making better buggy whips.”

Data can make the very idea of lessons obsolete.
The single biggest evolution in A/B testing over its history is not how pervasive it has become but rather how fast it has become. In the early ’00s, test results were typically delayed 24 hours: You ran a test today, saw the results tomorrow, and learned something—a principle, a rule of thumb—to apply to future designs. This might explain why testing began in marketing teams before it moved to product teams: Ads generally stick around over many days and weeks, making them amenable to revision at that pace. But for many web businesses, the product is too dynamic to sit still for that long.

That’s all different today. “Ten years ago you did not have data. Five years ago the best reporting tools were a day behind,” says Yulie Kim, VP of product at the furniture etailer One Kings Lane. “But we’re in a world now where you can’t wait a whole day to get your data.” Kim’s boss, CEO Doug Mack, says the speed of the feedback has become integral to the operation: “Big data is not enough. It has to be real-time data that we can act on during the course of the day. This has been a huge boon for the growth of our business.”

The difference with live testing is not just that there is no time to learn and apply lessons. It’s more radical than that: There are no clear lessons to learn, no rules to extract.

At the gaming network IGN, for example, executives found that crisp, clear prose was outperforming hyped-up buzzwords (like free and exclusive) on certain parts of the homepage. But in previous years, the opposite had been true. Why? They talked and talked about it, but no one could figure it out. Soon they realized that it simply didn’t matter. A/B would guide them at ground level, so there was no need to worry about why users behaved in one way or another.

Similarly, One Kings Lane has a business model that involves swapping out inventory every day, and Optimizely’s A/B tool plays a big role in the on-the-fly improvement that happens within each of these “flash sales.” Why do people like the ottoman better if it appears to the left of the throw rug than if it appears to the right? There’s no time to ask the question, and no reason to answer it. After all, what does it matter if you can get the right result? Keep testing, keep reacting, and save your philosophizing for the off-hours.

If you find that last implication to be somewhat troubling, you’re not alone. Even if we accept that testing is useful in learning how to run a business, it’s hard to take the next step and accept that we won’t learn how to run our businesses at all. Indeed, as A/B becomes more widespread, we might not even know what choices the tests are making: One of the burgeoning trends in A/B is to automate the whole process of adjudicating the test, so that the software, when it finds statistical significance, simply diverts all traffic to the better-performing option—no human oversight necessary.

On a more fundamental level, the culture of A/B cuts against our common-sense ideas about how innovation happens. Startups, we imagine, largely succeed or fail by long-term strategic decisions that are impossible to test with such precision. Likewise, it’s hard to imagine a midsize company A/B-ing its way out of obscurity to become a billion-dollar titan. Even among the tech giants, it seems like the most important decisions are immune to focus-grouping, let alone A/B testing.

Yes, Google has built its empire by listening to data, but we reserve our awe for the sort of vision that Steve Jobs brought to Apple, and we nod along at the famous answer he gave when asked how much market testing he did for the iPad: “None,” he said, echoing Henry Ford. “It’s not the consumers’ job to know what they want.” And in fact, it’s impossible to imagine how to arrive at something like the original Macintosh, with its lack of expansion slots and its impregnable chassis, entirely through evolutionary tweaks. How could the no-slots version possibly have won over the slots version? How could a one-button mouse edge out a two-button mouse? Yet somehow a number of ostensibly negative features, when combined in a precise way, result in something serene, elegant, and Zen.

It’s a false dichotomy, of course, to pose vision against data, lofty genius against head-down experimentation, as if companies are forced to choose between the two. Every firm ought to test the small stuff, at least; and no firm should (or does) use A/B for everything. Google doesn’t test things at random but relies on intuition and, yes, vision to narrow down the infinite number of possible changes to a finite group of testable candidates.

But it’s also true that the A/B culture, in part by shaming its HiPPOs into submission, can sometimes lead companies down dead-end paths. Testing allows you to constantly react to user preferences, but that doesn’t necessarily make you agile; 10,000 ongoing tweaks don’t add up to a fundamental change of direction when one is needed. Almost every successful company has to radically alter course at some point, and often such double-down decisions can’t be made in degrees or with a soft launch. And just as a testing culture can make it hard to address the big problems, it can also make it hard to stop sweating the small stuff. “I had a recent debate over whether a border should be three, four, or five pixels wide, and was asked to prove my case,” wrote ex-Google designer Douglas Bowman on his blog the day he left the company. “I can’t operate in an environment like that.”

The elegant minimalism of Apple’s design has trickled out into the world beyond technology. So it’s fair to ask: Could the scientific rigor of Google’s A/B ethos start making waves outside the web? Is it possible to A/B the offline world? With the rise of big data, some major retailers are embracing the experimental method. Chains will test out store floor plans in a few locations and then implement them nationwide if they boost revenues. Some retail software packages will oversee the rollout of individual products, putting them on a few shelves throughout the system and tracking their sales.

But the constraints of physical reality make it hard to experiment nearly as often, or to control one’s experiments so that the outcomes aren’t maddeningly ambiguous—biased, perhaps, by location factors or weather or some other unknown (and unknowable) variable. Faced with those ambiguities, the HiPPOs can still have their say without fear of contradiction. Only in the digital realm is it possible to be two different things at the exact same place and time and thereby to generate data that upends the whole nature of institutional authority.

Many web workers, having tasted of the A/B apple, can no longer imagine operating in any other environment. Indeed, they begin to look with pity on the offline world, a terrifying place where each of us possesses only one life to live rather than two (or more) in parallel. “There’s this grilled cheese place down the street,” says Jim Kingsbury, marketing VP at One Kings Lane. “They can’t test anything. Should they price the sandwich at $6 or $6.50? What should be at the top of the menu? Those are purely intuitive choices that they have to make.” At one Silicon Valley office, I overheard an employee complain that dating can’t be A/B tested; an online profile can, to be sure, but once you’re in a relationship with a specific person, 100 percent of the “traffic” is on the line with every decision.

The testable web is so much safer. No choices are hard, and no introspection is necessary. Why is B better than A? Who can say? At the end of the workday, we can only shrug: We went with B. We don’t know why. It just works.

0 comments:

Post a Comment