The Perils of Chasing p99. Hidden correlations can mislead… | by Krishna Rao

Hidden correlations can mislead optimization methods

Picture by Chun Equipment Soo on Unsplash

p99, or the worth beneath which 99% of observations fall, is extensively used to trace and optimize worst-case efficiency throughout industries. For instance, the time taken for a web page to load, fulfill a buying order or ship a cargo can all be optimized by monitoring p99.

Whereas p99 is undoubtedly helpful, it’s essential to acknowledge that it ignores the highest 1% of observations, which can have an unexpectedly giant affect when they’re correlated with different important enterprise metrics. Blindly chasing p99 with out checking for such correlations can probably undermine different enterprise aims.

On this article, we are going to analyze the restrictions of p99 by means of an instance with dummy information, perceive when to depend on p99, and discover alternate metrics.

Think about an e-commerce platform the place a workforce is tasked with optimizing the buying cart checkout expertise. The workforce has obtained buyer complaints that trying out is reasonably gradual in comparison with different platforms. So, the workforce grabs the most recent 1,000 checkouts and analyzes the time taken for trying out. (I created some dummy information for this, you’re free to make use of it and tinker with it with out restrictions)

import pandas as pd
import seaborn as sns
order_time = pd.read_csv('https://gist.githubusercontent.com/kkraoj/77bd8332e3155ed42a2a031ce63d8903/uncooked/458a67d3ebe5b649ec030b8cd21a8300d8952b2c/order_time.csv')
fig, ax = plt.subplots(figsize=(4,2))
sns.histplot(information = order_time, x = 'fulfillment_time_seconds', bins = 40, shade = 'ok', ax = ax)
print(f'p99 for fulfillment_time_seconds: {order_time.fulfillment_time_seconds.quantile(0.99):0.2f} s')

Distribution of order checkout occasions. Picture by writer.

As anticipated, most buying cart checkouts appear to be finishing inside a couple of seconds. And 99% of the checkouts occur inside 12.1 seconds. In different phrases, the p99 is 12.1 seconds. There are a couple of long-tail instances that take so long as 30 seconds. Since they’re so few, they might be outliers and must be protected to disregard, proper?

Now, if we don’t pause and analyze the implication of the final sentence, it might be fairly harmful. Is it actually protected to disregard the highest 1%? Are we certain checkout occasions will not be correlated with every other enterprise metric?

Let’s say our e-commerce firm additionally cares about gross merchandise worth (GMV) and has an general company-level goal to extend it. We should always instantly verify whether or not the time taken to checkout is correlated with GMV earlier than we ignore the highest 1%.

import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
order_value = pd.read_csv('https://gist.githubusercontent.com/kkraoj/df53cac7965e340356d6d8c0ce24cd2d/uncooked/8f4a30db82611a4a38a90098f924300fd56ec6ca/order_value.csv')
df = pd.merge(order_time, order_value, on='order_id')
fig, ax = plt.subplots(figsize=(4,4))  
sns.scatterplot(information=df, x="fulfillment_time_seconds", y="order_value_usd", shade = 'ok')
plt.yscale('log')
ax.yaxis.set_major_formatter(ScalarFormatter())

Relationship between order worth and success time. Picture by writer.

Oh boy! Not solely is the cart worth correlated with checkout occasions, it will increase exponentially for longer checkout occasions. What’s the penalty of ignoring the highest 1% of checkout occasions?

pct_revenue_ignored = df2.loc[df1.fulfilment_time_seconds>df1.fulfilment_time_seconds.quantile(0.99), 'order_value_usd'].sum()/df2.order_value_usd.sum()*100
print(f'If we solely focussed on p99, we'd ignore {pct_revenue_ignored:0.0f}% of income')
## >>> If we solely focussed on p99, we'd ignore 27% of income

If we solely centered on p99, we’d ignore 27% of income (27 occasions higher than the 1% we thought we had been ignoring). That’s, p99 of checkout occasions is p73 of income. Specializing in p99 on this case inadvertently harms the enterprise. It ignores the wants of our highest-value buyers.

df.sort_values('fulfillment_time_seconds', inplace = True)
dfc = df.cumsum()/df.cumsum().max() # p.c cumulative sum
fig, ax = plt.subplots(figsize=(4,4))
ax.plot(dfc.fulfillment_time_seconds.values, shade = 'ok')
ax2 = ax.twinx()
ax2.plot(dfc.order_value_usd.values, shade = 'magenta')
ax.set_ylabel('cumulative success time')
ax.set_xlabel('orders sorted by success time')
ax2.set_ylabel('cumulative order worth', shade = 'magenta')
ax.axvline(0.99*1000, linestyle='--', shade = 'ok')
ax.annotate('99% of orders', xy = (970,0.05), ha = 'proper')
ax.axhline(0.73, linestyle='--', shade = 'magenta')
ax.annotate('73% of income', xy = (0,0.75), shade = 'magenta')

Cumulative distribution perform of order success occasions and order worth. Picture by writer.

Above, we see why there’s a giant mismatch between the percentiles of checkout occasions and GMV. The GMV curve rises sharply close to the 99th percentile of orders, ensuing within the high 1% of orders having an outsize affect on GMV.

This isn’t simply an artifact of our dummy information. Such excessive correlations are sadly not unusual. For instance, the highest 1% of Slack’s prospects account for 50% of income. About 12% of UPS’s income comes from simply 1 buyer (Amazon).

To keep away from the pitfalls of optimizing for p99 alone, we will take a extra holistic method.

One resolution is to trace each p99 and p100 (the utmost worth) concurrently. This manner, we received’t be susceptible to ignore high-value customers.

One other resolution is to make use of revenue-weighted p99 (or weighted by gross merchandise worth, revenue, or every other enterprise metrics of curiosity), which assigns higher significance to observations with larger related income. This metric ensures that optimization efforts prioritize probably the most helpful transactions or processes, reasonably than treating all observations equally.

Lastly, when excessive correlations exist between the efficiency and enterprise metrics, a extra stringent p99.5 or p99.9 can mitigate the chance of ignoring high-value customers.

It’s tempting to rely solely on metrics like p99 for optimization efforts. Nonetheless, as we noticed, ignoring the highest 1% of observations can negatively affect a big share of different enterprise outcomes. Monitoring each p99 and p100 or utilizing revenue-weighted p99 can present a extra complete view and mitigate the dangers of optimizing for p99 alone. On the very least, let’s bear in mind to keep away from narrowly specializing in some efficiency metric whereas shedding sight of general buyer outcomes.