“The last will be first, and the first last”… Insights from the M5-competition

By Johann Robette, Supply Chain Expert @ Vekia

In most supply chains, forecast accuracy rules. Is it also your North Star metric? Then, this article may well change your point of view…

***

In a previous article, I built strong advocacy for a new generation of metrics developed by Vekia that focuses on the business impact of forecasts rather than on their accuracy/precision.

For those who missed this introductory article, here’s the executive summary of “Decision Impact”: 10 reasons to implement the new generation of business-oriented metrics [1].

The purpose of forecasting is not (and has never been) to provide the best forecast ever! Its purpose is to enable the best decision.
Yet, existing metrics only measure the intrinsic accuracy of the forecasts. None of them takes into account its actual use and the value it delivers.
By leveraging a “cost-oriented” digital twin, practitioners could benefit from a family of new metrics that focus on the “Decision Impact” of a forecast.
This new bunch of so-called “Decision Impact” (DI) metrics opens up new perspectives that benefit not only demand planners but also the whole companies.

This introductory article received great feedback, support and constructive critics from practitioners, academics and software vendors. I really would like to thanks them all!

Almost everyone was eagerly interested in examples of application — what I would call a Proof of Concept (POC). Here we go!

***

This new article is the first of a series that aims at demonstrating the value of such an approach through the lens of a real-world use case.

This series is then an opportunity to look at existing practices from a different perspective. So please, feel free to share, comment and criticize! I always see such engagement as a chance to be less wrong and to improve continuously.

What’s in it for me? [Spoiler alert]

This article delivers lots of details about the effective implementation of the “Decision Impact” (DI) metrics
This first use case of DI metrics demonstrates that accuracy metrics lead to the selection of forecasting methods that are absolutely not appropriate from a business perspective. In fact, the “bests” methods could trigger the worst decisions!

The “Walmart” M5-competition dataset as a playground

At Vekia, we leverage the “Decision Impact” metrics for the benefit of our customers in various industries (Telco, Pharma, Retail, Energie, Maintenance, etc.). But as one would easily understand, we do value their privacy and won’t share anything about these contexts (at least without prior authorization).

Let’s then switch to a dataset that is both relevant and publicly available. I’m pretty sure the one we chose will make this series even more insightful!

Among the potential forecast-oriented datasets, one of the most exciting was published in 2020 to support the 5th edition of the Madrikakis competition.

I’m pretty sure you’ve already heard of this worldwide prediction competition organized by the Makridakis Open Forecasting Center (MOFC) and hosted on Kaggle.com. This competition takes his name from Dr Spyros Makridakis, Professor at the University of Nicosia, who has had (and keeps having) an impressive contribution to forecasting practices. If you’d like to know more about the M5-competition, here is the place to learn about it: https://www.kaggle.com/c/m5-forecasting-accuracy

Let’s come back to the dataset.

What’s in this dataset?

This dataset was provided by Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days. The data covers ten stores in three US states (California, Texas, and Wisconsin) and includes details about items, departments, categories, and store. In addition, it also contains explanatory variables such as prices, promotions, day of the week, and special events.

With 30,490 product x stores couples, this dataset is a wonderful playground for our experimentation.

The cherry on the cake is: hundreds of forecasters from all around the world have leveraged this dataset to generate the best possible forecasts. The competition being over, MOFC announced the competition results, released the “to-be-predicted” sales, as well as 24 “benchmark” forecasts and the top 50 “submissions”.

So typically, we not only have data describing stores, products and sales… but also 74 forecasts we could leverage to bench test our “Decision Impact” metrics.

In its integrality, this dataset then contains not less than 63 million daily forecasts. Awesome!

10 stores x 3049 products x 28 days x 74 forecasts = 63,175,280 daily forecasts

What’s not in this dataset?

The Madrikakis competition is all about forecasting. Without any surprise then, it does not provide a clue on the decisions those forecasts are intended to support.

But as you know, “Decision Impact” metrics requires considering the decision process that consumes the forecasts. So, we need some additional data to model a realistic replenishment process and a cost function.

Let’s then make some assumptions! If you’re not interested in those details, feel free to jump to the next “Let’s start playing with this dataset” section.

Replenishment strategy

Leadtime

We don’t have any clue on the lead time required between the creation of a replenishment order and the effective store replenishment. Let’s then assume 3 days are necessary.
Given orders are trigger on day #1, we’ll only consider demand starting from day #4.

Order cycle

As you know, retail stores are not replenished only once a month. Order cycles often vary depending on product types, sales velocity, storage capacity, etc. But for the sake of simplicity, let’s opt for a weekly order cycle.
Our replenishment process then aims at covering one week of sales. Decisions triggered on day #1 cover sales of days #4 to #10. Decisions triggered on day #8 cover sales of days #11 to #17. Decisions triggered on day #15 cover sales of days #18 to #24.

Replenishment policy

Let’s implement a dynamic (T, S) ‘periodic-review, order-up-to-level’ replenishment policy.
This policy considers forecasts, initial inventories, safety stocks and pack sizes. As pack sizes allow for different order quantities (depending on the applied rounding function), our policy will assess the economic cost of each scenario (as defined below) to eventually pick up the most profitable one.

Safety stocks

Let’s configure safety stocks to reach a 95% service level (z-Score 1.65, standard deviation of 2 years of weekly sales history).

Cost function

The cost function is used to score each replenishment decision based on its true business impact. Although costs could be expressed in various units, it’s pretty convenient to express them as a monetary value representing the cost of a given decision.

To evaluate these costs, we’ll again have to enrich the existing dataset and make some assumptions:

Ordering, shipping and handling cost

Fulfilling an order generates costs for the ordering, preparation, expedition, transportation, etc. These are usually referred to as “fixed costs”. Let’s assume these costs represent $40 per range of $1000 of purchase value.

Holding cost

Holding costs are associated with the storage of unsold inventories. Let’s assume the annual holding cost is 10% of the inventory value (valued at purchase price), ie. 0.19% per week.

Shortage cost

When demand exceeds the available inventory, both the demand and customer goodwill may be lost. This cost is usually called shortage cost. As retailers propose a wide range of similar products, a part of the demand is carried other to other products. Let’s assume that only half of the sales will effectively be lost. The shortage cost could then be measured as 50% of each lost sale gross margin.

The total decision cost is then computed as the sum of those 3 elementary costs.

Additional product description

Gross margin

The dataset provides the sale price of each product, but the gross margin is unknown. As this data is required, the following values are assumed based on Statista.com benchmarks[2] : Foods: 56.77% / Hobbies: 50.37% / Household:43.29%.

Pack sizes

Obviously, in grocery retail, few products are replenished individually. Yet, pack sizes are not part of this dataset. We’ll then assign a custom pack size to each product, based on a proprietary dataset describing common pack sizes used in grocery retail.

Initial inventory

When replenishments occur, shelves are hopefully not empty. Obviously, the remaining inventories are still in place and should be accounted for.
As our replenishment policy aims at guaranteeing a safety stock at the end of a coverage period, let’s take this quantity as an initial inventory.

Let’s start playing with this dataset

Based on the above assumptions, the dataset enables the analysis of 6,860,250 replenishment decisions empowered by 74 different forecast methods. Let’s now see what we could learn from them.

As a first analysis, I propose to focus on computing each method costs and compare them to their accuracy ranks.

If classical forecast accuracy metrics natively promote the best methods, then there’s no need to add new metrics on top of this. But, if it doesn’t, “Decision Impact” metrics make sense…

74 forecast methods

This dataset proposes 74 different forecast methods, composed of 24 “benchmark” methods and the top 50 methods “submitted” by the M5-competitors. Those methods range from the simplest “Naive method” to the most advanced AI-based ones.

Evaluating costs

Each of these methods triggers its own decisions. The previously defined digital twin enables simulating these replenishment decisions and measuring costs.

The results we got show a wide range of costs, ranging from $77.2k to $116.5k! The difference of $39.2k is not anecdotical as it represents not less than 1.4% of the period turnover.

The graph below represents each method and its associated cost. Methods are ordered from the cheapest to the most expensive one.

Here’s the first insight we could get: the cheapest solutions happen to be mostly competitor’s “submissions”. But… not all submissions outperform basic benchmarks, even though they are in the “top 50” methods over 5,507 teams. In fact, more than a third of them (17) perform less than the “F_ESX” benchmark!

M5-competition leaderboard

Let’s now compare those costs to the M5 official leaderboard [3].

The M5 organization scored each method based on a forecast accuracy metric called Weighted Root Mean Squared Scaled Error (WRMSSE) [4].

Update 27/05/2021: as pointed out by Nicolas Vandeput (many thanks), the M5 official leaderboard is built on the average WRMSSE of each method, computed at various levels (from L1 “Global” to L12 “Item x Site”).

The graph below adds each method’s average accuracy rank to the previous display.

Submissions ranks are on average lower than benchmark ranks. But clearly, the competition ranks correlate poorly to business costs. The worst example being Matthias’ method which was ranked second in the competition, whereas it’s the second most costly method from a business perspective after the Naive benchmark.

The competition leaderboard is poorly correlated to costs. This is a fact. But what about other accuracy metrics? Would they be stronger correlated? Let’s assess this!

Other metrics leaderboard

To assess how good other metrics are at capturing business costs, we computed MAPE, wMAPE, sMAPE, MSLE, MAE, MSE, RMSE, WRMSSE, BIAS, NFM (the two last being bias metrics) and their respective methods ranks.

The graphs below display those ranks. The methods are always sorted from the cheapest to the most costly one.

Although most metrics attributes lower ranks to the first half than to the second one, the correlation is far from being good enough from a business perspective. Some poorly performing methods still get assigned crazy low ranks!

Decision Impact metrics

Now let’s put our newly introduced metrics into the testbench.

Our new metrics include DIao, DIno and DIna. DIao measures the financial cost of an erroneous forecast. It then pretty well compares to accuracy metrics.

How is it correlated to costs then? The graph below displays the various methods’ costs and their associated DIao ranks.

By design, Decision Impacts metrics consider costs in their computation. Therefore, with no surprise, DIao metric correlates perfectly with costs.

This metric then performs great at identifying the most appropriate forecasting methods when business matters!

Decision impact leaderboard

Oops… I was about to end this article without sharing the “business” leaderboard! Here it is!

“The last will be first, and the first last”… our congratulations go out to :

Nodalpoints (Athens, Greece), previously ranked 21
Hiromitsh Kigure (Japan), previously ranked 45
leoclement (Paris, France), previously ranked 18

The complete DIao “Business” leaderboard is here including the new DIao ranks and the original M5-competition ranks:

***

This being said, one might still wonder: “Why should we focus on DIao (which requires the simulation of actual and oracle decision costs) while Forecast Decision Cost could be a simpler and most straightforward metric?”

The short answer is: “Because DI metrics enable way more uses than Forecast Decision Cost.”

We’ll go through each of them in the next articles of this series.

***

This article aims to shed light on current practices, limitations and possible improvements. It’s for sure not perfect and suffers from limits.

If you found this to be insightful, please share, comment and clap… But also, feel free to challenge and criticize. Contact me if you want to discuss this further!

In all cases, stay tuned for the next articles! In the meantime, visit our website www.vekia.fr to know more about our expertise and experience in delivering high value to Supply Chain.

Linkedin: www.linkedin.com/in/johann-robette/

Web: www.vekia.fr