There is an excellent paper by Spyros Makridakis, Evangelos Spiliotis and Vassilios Assimakopoulos you can download here that includes a detailed account of the philosophy and objectives of the M competitions since 1982 and the organization, implementation and results of the latest M5 competition that ended June 30, 2020. The paper also includes many useful references those interested in forecasting will find invaluable.
According to the organizers, there were more than 7,000 participants in M5 competition from 101 countries and more than 88,700 submissions. The winner, to the surprise of many, was an undergraduate student from South Korea, YeonJun Im, who used “an equal weighted combination (arithmetic mean) of various LightGBM models” (more details in the referenced paper above.)
I will not go into the details of the competition and the results because they are provided in detail in the referenced paper. Instead, I will concentrate on comments and suggestions.
Note that I do not consider myself a forecasting expert by far although in the last 25 years I have done extensive work mainly in financial market forecasting and before that for shipping and commodity markets. I gave a presentation in New York during the M4 conference award ceremony in December 2018 about the challenges in financial forecasting due to complexity and reflexivity. The presentation can be found here.
I will go right to the point: what I have learned in the last 25 years is that in most cases the best model ceases to be best in the future. Usually it is the model somewhere in the middle that stays robust although it has lower accuracy and higher error. This has been known to developers of investment and trading strategies for a while and the reason in a nutshell is curve-fitting.
My aim here is not to undermine the achievements of top winners of any competitions. Whether it was a genuine model or an over-fitted one that won the top place in some competition that still required a lot of effort and skill. But this is the problem on a higher level:
Given a large population of participants in a competition there is always high probability that some random model(s) will perform well in test sample by chance alone regardless of the evaluation metric used. The probability cannot be estimated and depends on the nature of price series, methods used, number of trials, etc. Therefore, this is a question that arises:
How do we know that the top models are not spurious even if the have high score in the test sample?
In finance, one answer is provided by forward testing on actual, not previously recorded data. In this spirit then, these are my suggestions:
Forward testing of models before determining winners
The winners are announced after sufficient forward testing using data that were not available during the time of the contest to anyone, even the companies involved. The forward testing period can be as long as the test period or even longer. For example, in the case of sales forecasts, it could be one year into the future. In this way, one may obtain a better idea of how top models compare to models around median performance into the future.
Obviously, this will reduce participant count because people are looking for immediate results.
Performance monitoring of models
It would be of more value to the field of forecasting to monitor the performance of all winning models throughout the years at least for those series that are available from all competitions. This can provide a good idea if the models are specific to the data, among other things. For example, if the top models from M4 break down with M5 data even after proper adjustments and tuning, then an investigation of these findings may help to advance the field of forecasting more than selecting the top model from each competition. Why do models breakdown when they do? This is an important question looking for answers.
No leaderboard and forums
I suggest this because forecasting is both an art and a science knowing that many will object. In competitions there are “opportunistic participants” that basically buy a lottery ticket priced at $0 (less time and cost of resources used) in hope they will be top performers and win the high prize. In my opinion, they use the forums to “fish” for good ideas in hope to “adjusted” them and win. But this creates unnecessary noise although it may be good for boosting participant count.
For example, during the Virtual Award Ceremony this week and while Dr. Chris Frey was delivering a very interesting talk 33 minutes into the ceremony, the viewer count was only 93. That was mere 1.3% or less of the announced participant count.
This was not by any means a problem of the organizers and the Virtual Award Ceremony was very well organized and executed. But this shows that many people are interested only in the prize. In fact, after Nassim Taleb announced in Twitter that he will join the ceremony, the viewer count spiked to 173 during the presentation of Addison Howard from Google/Kaggle.
But the viewer count fell again below 100 after Nassim Taleb finished his presentation.
I believe large participation and “advancing the field of forecasting” do not necessarily correlate. In my humble opinion 100 top experts rather than 1000 random ones may have a better impact on advancing forecasting. Obviously, I do not qualify for this category but I know some people I have worked with in financial forecasting who do.
In 2016, I joined the Numerai competition for financial market predictions and I submitted an extremely naïve forecast to see how it would fair, just for fun and because I had a 10 minute break from work. Click here for the relevant article. The naïve submission ranked 168 out of 402 submissions with most of them using advanced machine learning algos. Many were trying from the forum and published articles to “fish” a good method and then “fit” it to leaderboard in an effort to win bitcoins. It cost nothing to them other than their time.
Summary of suggestions
- Forward testing of models before determining winners
- Performance monitoring of models
- No leaderboard and forums
Although the above suggestions will affect participation, in my opinion there will be benefits to the forecasting field. Even if it is determined that there are no models that can withstand the test of time, this is still a useful result to many.
Finally a twitter poll.