Thursday, October 9, 2014

The Limits of Experimentation and Big Data

Experimentation represents a critical tool for business decision making, especially in online environments. It is said that Google runs about a quarter of a million experiments on its users every day. Sophisticated online firms will have a built in toolset allowing managers to quickly and easily code in the experiment they wish to run (within certain limits), and even spit out standard analysis of the results. As a result of these experiments, firms continually innovate, mostly in small ways, to increase engagement, conversion rates, items purchased and so on.

The basic idea is simple. Firms with millions of visitors each day will choose a small percentage of these to be (unwitting) subjects for the experiment they wished to run. These individuals might be picked at random from all individuals or, more likely, they will be selected on the basis of some predetermined characteristics of individuals who are hypothesized to respond to the experimental treatment. In the simplest design, half of the selected subjects, chosen at random (or chosen at random within each strata) will receive the control, the experimental baseline, which will, most often, be simply business as usual. The other half will receive the treatment, which could be an alteration of the look and feel of the site, but might also be things like exposure to promotions, changed shipping terms, special offers of after sales service, or a host of other possibilities. Online sites rarely experiment with price, at least in this treatment-control way, owing to the bad publicity suffered by Amazon in the late 90s and late 00s from such experiments.

Following this, the data is analyzed by comparing various metrics under treatment and control. These might be things like the duration of engagement during the session in which the treatment occurred or the frequency or amount of sales during a session, which are fairly easy to measure. They might also be things like long-term loyalty or other time-series aspects of consumer behavior that are a bit more delicate. The crudest tests are nothing more than two sample t-tests with equal variances, but analysis can be far more sophisticated involving complicated regression stuctures containing many additional correlates besides the experimental treatment.

When the experiment indicates that the treatment is successful (or at least more successful than unsuccessful), these innovations are often adopted and incorporated into the user experience. Mostly, such innovations are small UX things like the color of the background or the size of the fonts used, but they are occasionally for big things as well like the amount of space to be devoted to advertising or even what information the user sees.

After all this time and all the successes that have been obtained, were we to add up the amounts of improvement in various metrics from all the experiments, we would conclude that consumers are spending in excess of 24 hours per day engaging with certain sites and that sales to others will exceed global wealth by a substantial amount. Obviously, the experiments, no matter how well done, are missing something important.

The answer, of course, is game theory, or at least the consideration of strategic responses by rivals, in assessing the effect of certain innovations.

At first blush, this answer seems odd and self-serving (the latter part is correct) in that I made no mention of other firms in any of the above. The experiments were purely about the relationship between a firm and its consumers/browsers/users/visitors etc. Since there are zillions of these users and since they are very unlikely to coordinate on their own, there seems little scope for game theory at all. Indeed, these problems look like classic decision problems. But while rivals are not involved in any of this directly, they are presently indirectly and strategically by affecting the next best use of a consumer's time or money, and changes to their site to improve engagement, lift, revenue and so on will be reflected in our own relationship to customers.

To understand this idea, it helps to get inside the mind of the consumer. When visiting site X, a consumer chooses X over some alternative Z. Perhaps the choice is conscious--the consumer has tried both X and Y, knows their features, and has found X to be better. Perhaps the choice is unconscious. The point is simply that the consumer has a choice. Let us now imagine the site X is experimenting between two user experiences, x and x' while firm Z presently offers experience z.. The consumer's action, y then depends not just on x but also on z or at least the perception of z. Thus, we predict some relationship
y = a + b x + c z + error

when presented with the control and a similar relationship but with x' replacing x under the treatment. If we then regress y on x, we suffer from omitted variable bias: z should have been in the regression but was not; however so long as z is uncorrelated with the treatment (and there is no reason it should be), then our regression coefficient on the treatment dummy will correctly tell us the change in y from the change in x, which is, of course, precisely what we want to know. 

Thus, buoyed by our experiment, we confidently implement x' since the statistics tell us it will raise y by 15% (say). 

But notice that this analysis is the statistical equivalent of inward thinking. Despite its scientific garb, it is no more valid an analysis than a strategic analysis hypothesizing that the rival will make no changes to its strategy regardless of what we might do. When we think about large decisions, like mergers, such a hypothesis is obviously silly. If Walmart acquired eBay tomorrow, no one would claim that Amazon would have no reaction whatever, that it would keep doing what it had been doing. It would, of course, react, and, were we representing Walmart, we would want to take that reaction into account when deciding how much to pay for eBay. 

But it is no less silly to think that a major business innovation undertaken by X will lead to no response from rivals either. To see the problem, imagine we were interested in long run consumer behavior in response to innovation x'. Our experiment tells us the effect of such a change, conditional on the rivals' strategies, but says nothing about the long-term effect once our rivals respond. To follow through with our example, suppose that switching to x' on a real rather than experimental basis will trigger a rival response that changes z to z'. Then the correct measure of the effect of our innovation on y is
Change in y = b(x' - x) + c(z' - z)

The expression divides readily into two terms. The first term represents the inward thinking effect. In a world  where others are not strategic, this measures the effect of the change in x. The second part represents the outward thinking strategic effect. This is the rival's reaction to the changed relationship that firm X has with its customer. No experiment can get at this term, no matter how large is the dataset. This failure is not a matter of insufficient power or the lack of metrics to measure y or even z, its the problem of identifying a counterfactual, z', that will only come to pass if X adopts the innovation. 

Now, all is not lost. There are many strategies one can use to forecast z', but one needs to be open to things that the data can never tell us, like the effect of a hypothetical rival reaction to a hypothetical innovation when viewed through the lens of consumer choice. This is not a problem that statistics or machine learning can ever solve. Game theory is not simply the best analysis for such situations, it is the only analysis available. 

No comments: