<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Aeon Toolkit on Medium]]></title>
        <description><![CDATA[Stories by Aeon Toolkit on Medium]]></description>
        <link>https://medium.com/@aeon.toolkit?source=rss-6ea57a9bd09a------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/0*N24z6lTSjeVe5aB1</url>
            <title>Stories by Aeon Toolkit on Medium</title>
            <link>https://medium.com/@aeon.toolkit?source=rss-6ea57a9bd09a------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 06 May 2026 16:48:51 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@aeon.toolkit/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[The aeon Theta forecaster]]></title>
            <link>https://medium.com/@aeon.toolkit/the-aeon-theta-forecaster-afaa190d6dbb?source=rss-6ea57a9bd09a------2</link>
            <guid isPermaLink="false">https://medium.com/p/afaa190d6dbb</guid>
            <category><![CDATA[time-series-data]]></category>
            <category><![CDATA[time-series-forecasting]]></category>
            <category><![CDATA[forecasting]]></category>
            <category><![CDATA[timeseries]]></category>
            <category><![CDATA[time-series-analysis]]></category>
            <dc:creator><![CDATA[Aeon Toolkit]]></dc:creator>
            <pubDate>Fri, 15 Aug 2025 12:12:39 GMT</pubDate>
            <atom:updated>2025-08-15T12:12:39.523Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/343/1*rUB7Ukh5JZ87g7BeYytRSA.png" /></figure><p>As a summer side project I decided to implement a <a href="https://github.com/aeon-toolkit/aeon/pull/2978">Theta forecaster</a> in <a href="https://www.aeon-toolkit.org/en/stable/">aeon</a>. I had heard of it before implementing it, but didn’t really understand how it worked, so I will also include some description. It was proposed in 2000 in this paper:</p><p><em>Assimakopoulos, V., &amp; Nikolopoulos, K. (2000). The Theta model: a decomposition approach to forecasting International Journal of Forecasting, 16(4), 521–530.</em><br><a href="https://doi.org/10.1016/S0169-2070(00)00066-2">https://doi.org/10.1016/S0169-2070(00)00066-2</a></p><p>A later paper states <em>“The original description of the method given by Assimakopoulos and Nikolopoulos involves several pages of algebraic manipulation.”</em> [2]. The actual Theta algorithm is really very simple. Medium doesn’t seem well set up for maths, so I will also direct you to our <a href="https://github.com/aeon-toolkit/aeon/blob/main/examples/forecasting/theta.ipynb">notebook </a>for a fuller description, but you don’t need pages of maths to understand it. It is a very fast forecaster that is surprisingly effective.</p><p>It was the overall winner in the M3 forecasting competition across all 3,003<br>series, outperforming other statistical models like ARIMA, exponential smoothing, and more complex approaches. It was particularly strong on yearly and quarterly data, and it became widely cited as a benchmark thereafter.</p><p>In the M4 Competition (2018) Theta was still competitive, but pure Theta was outperformed by hybrid and ensemble methods like Theta+ETS+ARIMA combinations, and machine learning–enhanced versions. The best-performing entries in M4 (e.g., Hybrid ES-RNN and Theta combinations) included Theta as a component in an ensemble rather than using it alone.</p><p><em>Makridakis et al. (2020), The M4 Competition: Results, findings, conclusion and way forward, International Journal of Forecasting, 36(1), 54–74.</em></p><p>This makes it surprising that I don’t see it in recent forecasting comparisons, which tend to stick to ETS/ARIMA models.</p><p>But importantly, why is it called Theta? Obviously it has a parameter called Theta, but why that? It seems an arbitrary choice, just a spare greek letter not used in ARIMA I think.</p><h3>The Algorithm</h3><p>The core idea behind theta is to fit a linear trend, then model variation around that trend. The steps of the algorithm are:</p><ol><li>Fit a linear trend of y against time.</li><li>Form the <strong>theta line </strong>to combined linear trend and original data.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/301/1*7rsOCAP9ZaptO8EUkkkfBw.png" /></figure><p>3. Smooth the theta line using simple exponential smoothing (SES) to obtain the level.</p><p>4. Combine the level and trend to form a forecast.</p><p>Theta is the crucial parameter. For 0 it is the trend, for 1 it is the original data. Values greater than one inflate the theta line away from the trend line. The original default was Theta of 2.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ImC5purL29nga0Yz1mCUXQ.png" /></figure><p>The SES alpha parameter is usually found through a grid search, and weight for forecasting is set in the constructor, defaulting to 0.5 (equal weight for level and trend).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/879/1*1UqPgHj9z8k7qL1z1BwfTg.png" /><figcaption>Forecasts for different weights</figcaption></figure><p>In [2] Theta is characterised as ETS with drift. I see theta more as a tweaked linear trend. It would be interesting to assess whether the level makes any actual difference in performance, and how it compares to window based regression.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/810/1*To47uUwBFSgKwLMXAxUOxg.png" /><figcaption>Breakdown of Theta including training forecasts</figcaption></figure><h3>The Implementation</h3><p>As with <a href="https://medium.com/@aeon.toolkit/blazingly-fast-arima-with-aeon-9796d7015cdd">ARIMA</a>, our objective is a super fast, super simple stripped down implementation we can use as a component in other forecasters. The Theta implementation is about 150 lines long (including comments) and the core code is just this (all in numba)</p><pre>def _fit_predict_numba(y: np.ndarray, h: int, theta: float, weight: float):<br>    n = len(y)<br><br>    # --- Trend component (theta=0)<br>    t = np.arange(n, dtype=np.float64)<br>    t_mean = t.mean()<br>    y_mean = y.mean()<br>    dt = t - t_mean<br>    dy = y - y_mean<br>    var_t = (dt * dt).sum()<br><br>    if var_t &lt;= 1e-12:<br>        b = 0.0<br>        a = y_mean<br>    else:<br>        b = (dt * dy).sum() / var_t<br>        a = y_mean - b * t_mean<br><br>    trend_in = a + b * t<br>    trend_fut = a + b * np.arange(n, n + h, dtype=np.float64)<br><br>    # --- Theta line<br>    theta_line = theta * y + (1.0 - theta) * trend_in<br><br>    # --- Estimate alpha for SES by SSE minimisation<br>    best_sse = 1e300<br>    alpha = 0.1<br>    for k in range(1, 101):<br>        a_try = k / 100.0<br>        sse = _ses_sse(theta_line, a_try)<br>        if sse &lt; best_sse:<br>            best_sse = sse<br>            alpha = a_try<br><br>    # --- SES forecast<br>    level = _ses_last_level(theta_line, alpha)<br>    ses_fut = np.full(h, level)<br><br>    # --- Combine components<br>    w = min(max(weight, 0.0), 1.0)<br>    forecast = w * trend_fut + (1.0 - w) * ses_fut<br><br>    return forecast, a, b, alpha</pre><p>I’m sure there are lots of variants, but we want super fast, and this fits the bill.</p><h3>The Comparison</h3><p>You can run comparable models in the two toolkits like this.</p><pre>from aeon.forecasting.stats import Theta<br>from statsmodels.tsa.forecasting.theta import ThetaModel<br>from statsforecast import StatsForecast<br>from statsforecast.models import Theta as SFTheta<br># statsmodels<br>sm= ThetaModel(y_train, period=None, deseasonalize=False)<br>res =  sm.fit()<br># statsforecast<br>pred = res.forecast(1)[0]<br>    df = pd.DataFrame({<br>        &quot;unique_id&quot;: [&quot;series_1&quot;] * n,<br>        &quot;ds&quot;: pd.date_range(&quot;2000-01-01&quot;, periods=n, freq=&quot;D&quot;),<br>        &quot;y&quot;: y_train,<br>    })<br>sf = StatsForecast(models=[SFTheta(season_length=1)], freq=&quot;D&quot;)<br>pred = sf.forecast(1, df)<br># aeon<br>aeon = Theta().fit(y_train) # default theta=2.0, weight=0.5<br>pred = aeon.predict(y_train)</pre><p>Versions are statsforecast 2.0.2 and statsmodels 0.14.5. Experiments can be reproduced with this <a href="https://github.com/aeon-toolkit/aeon-benchmark/blob/master/aeon_benchmark/forecasting/theta.py">notebook</a>. Until version 1.3 is released you will need to install aeon from main like this</p><p>pip install git+<a href="https://github.com/aeon-toolkit/aeon.git@main">https://github.com/aeon-toolkit/aeon.git@main</a></p><p>You may have to uninstall 1.2 first if you have it in the environment already. Version 1.3 should be released by the end of August 2025.</p><p>Set up is for each n we generate 50 random series of trended data. Train the model on n-1 points, measure squared error on point n and average. Left as you look below is the MSE for increasing n, right is average run time.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*18sdLlXQFqTAu_p9t1IKXg.png" /></figure><p>statsmodels is slightly different (and worse), maybe I have not configured it the same. statsforecast and aeon are near identical. For run time, surprisingly, statsforecast is slower than statsmodels. I have not included the time to shove things into a data frame, but I suspect data frame interaction is the cause of the slow down. Perhaps they are using more complex linear model implementations. Maybe the Nixtla folk have a super fast version on the horizon. It’s worth putting the difference into context. The algorithm is insanely fast, and if it takes five seconds to fit a series length 20,000 that will probably be ok for nearly all use cases. It pleases me as a computer scientist to get orders of magnitude speed up though :)</p><p>Processing series length n=20000 (50 repeats)…<br> aeon MSE=0.261192 Avg time=4.360518e-03s<br> statsmodels MSE=0.263235 Avg time=2.438506e-01s<br> statsforecast MSE=0.260944 Avg time=5.971252e+00s</p><h3>The Future</h3><p>Like ETS, Theta is not designed for forecasting on a horizon. It will always simply project the linear trend weighted with the last level value (see the weights figure above). Despite this, it is regularly assessed for forecasts over a horizon.</p><p>I’m new to this field. I have been surprised how hard very simple forecasters are to beat for short horizon univariate forecasting. This seems to be widely known and accepted, but it still surprises me and those who work in machine learning who I talk to. Perhaps naively, but it seems that it should be possible to consistently beat Theta/ETS/windowed linear regression/random forest with a time series based regression approach. Our overarching research question is this:</p><p><strong>Given a reasonably long series and no prior knowledge, what is the best approach for single step ahead forecasting?</strong></p><p>In all things, Occam’s razor guides us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/405/1*9EGfFrX6rKvmqHFJAM8YIA.png" /><figcaption>From wiki</figcaption></figure><p>Or in research term, if you cannot beat a really simple technique on average over a large number of problems, you have to make a more detailed argument for the necessity of your new algorithm. Just because you have a hammer, it does not make everything a nail. An algorithm being cleverly designed is not justification without a clearly stated and soundly evaluated use case that shows significant improvement on benchmarks.</p><p>This also then leads to many follow up questions: what is the best by data characteristic? What is best by problem domain? what is best at iteratively forecasting? All these questions have of course been addressed in one way or another in the past. Everything has been addressed in one way or another in the past. However, I cannot find a definitive answer to these questions. I feel a bake off coming on ….</p><p>For Theta in particular, I would like to assess the actual impact of the smoothing stage. Also, looking a little deeper, there are some interesting variants we could explore. A quick search reveals these. Feel free to point me at other work that might improve Theta. I’m particularly interested in its use in ensembles, but I’ve not gone there yet. Perhaps I should actually read the papers below first :)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*S4TZdpywkbmxCe9UIqzd9w.png" /></figure><p><strong>References</strong><br> [1] V. Assimakopoulos &amp; K. Nikolopoulos (2000).<a href="https://www.sciencedirect.com/science/article/abs/pii/S0169207000000662"> <em>The Theta model: A decomposition approach to forecasting</em></a><em>.</em> International Journal of Forecasting, 16(4), 521–530.<br> [2] R.J. Hyndman &amp; S. Billah (2003). <a href="https://www.sciencedirect.com/science/article/abs/pii/S0169207001001431"><em>Unmasking the Theta method</em></a><em>.</em> International Journal of Forecasting, 19(2), 287–290.<br> [3] J.A. Fiorucci, T.R. Pellegrini, F. Louzada, F. Petropoulos &amp; A.B. Koehler (2016). <a href="https://www.sciencedirect.com/science/article/pii/S0169207016300243"><em>Models for optimising the Theta method and their relationship to state space models</em></a><em>.</em> International Journal of Forecasting, 32(4), 1151–1161.<br> [4] J.A. Fiorucci et al. (2015). <a href="https://arxiv.org/abs/1503.03529"><em>The Optimised Theta Method</em></a><em>.</em> arXiv preprint arXiv:1503.03529.<br> [5] E. Spiliotis, K. Nikolopoulos &amp; V. Assimakopoulos (2020). <a href="https://www.sciencedirect.com/science/article/abs/pii/S0377221720300242"><em>Generalizing the Theta method for automatic forecasting</em></a><em>.</em> European Journal of Operational Research, 284(2), 550–558.<br> [6] G. Sbrana &amp; A. Silvestrini (2025). <a href="https://www.sciencedirect.com/science/article/pii/S0169207024000906"><em>The structural Theta method and its predictive performance in the M4-Competition</em></a><em>.</em> International Journal of Forecasting, 41(3), 940–952.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=afaa190d6dbb" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Forecasting horizons]]></title>
            <link>https://medium.com/@aeon.toolkit/forecasting-horizons-da426b7a6c24?source=rss-6ea57a9bd09a------2</link>
            <guid isPermaLink="false">https://medium.com/p/da426b7a6c24</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[timeseries-forecasting]]></category>
            <category><![CDATA[forecasting]]></category>
            <category><![CDATA[time-series-forecasting]]></category>
            <category><![CDATA[timeseries]]></category>
            <dc:creator><![CDATA[Aeon Toolkit]]></dc:creator>
            <pubDate>Mon, 11 Aug 2025 11:39:21 GMT</pubDate>
            <atom:updated>2025-08-11T11:39:21.317Z</atom:updated>
            <content:encoded><![CDATA[<p>By <a href="https://timeseriesclassification.com/">Tony Bagnall</a> and <a href="https://hadifawaz1999.github.io/">Ali Ismail-Fawaz</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*zT6sqTiPIQv9e1aI1vI0HQ.png" /><figcaption>Panoramic view of Ardnave on the island of Islay, my happy place</figcaption></figure><p>Some of us at <a href="https://github.com/aeon-toolkit/aeon">aeon</a> are dabbling with forecasting, and there are several things that really confused us in the field to begin with. I thought it might help others if we spelt out my confusion.</p><p>Let’s start with a time series. Keep it univariate for simplicity. And lets not use airline, it’s a bit boring. Tony has worn a fitbit all year, so lets forecast the number of steps he does in a day. Glad to be hitting my 10k a day :)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/888/1*2lQpGYUjKMaBbOCbE8iM9w.png" /><figcaption>Number of steps per day by Tony in 2025</figcaption></figure><p>I’m not looking at the specifics of analysis, seasonality, trend, heteroscedasticity etc. I don’t actually want to forecast my steps, I am using it as an example to demonstrate variants of forecasting horizons.</p><p>Our basic functionality is to fit a model and make a single prediction one step ahead. Lets take two aeon forecasters that implement BaseForecaster as example.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*J1SuVv6KbUMtN1lnPYkozQ.png" /></figure><pre>from aeon.forecasting.stats import ARIMA<br>from aeon.forecasting import RegressionForecaster<br>arma = ARIMA(p=1,q=1)<br>p = arma.forecast(y_train)  # One ahead forecast<br>print(f&quot; ARIMA predicts Tony will take {p} steps.&quot;)<br>arma.fit(y_train)<br>p2 = arma.predict(y_train)<br>print(f&quot; This is functionally equivant to forecast: {p2} steps.&quot;)<br>arma = ARIMA(p=1,q=1)<br>reg = RegressionForecaster(window=50)  # Default is scikit-learn LinearRegression<br>p = reg.forecast(y_train)  # One ahead forecast<br></pre><p>But in practice, this is a limited use case. We may want to predict further in the future than how many steps Tony does tomorrow. For example, we may want to predict how many steps Tony does on this day in a weeks time, or predict every day for the next seven days. We may even want a long horizon, and predict how many steps we take for all days remaining in the year. This is usually called the <strong>forecasting horizon</strong> or just the <strong>horizon </strong>and often characterised as a short horizon (various definitions of short, we have seen less than 12 used) or long horizon (often defined as “not short”). Very few papers look at single point forecasts (probabilistic is a whole separate issue). We suspect this is because we all like to draw graphs. The fact is that the use case of making multiple predictions in the future complicates the learning algorithm and the interaction between fit and predict. This is the central point of this blog.</p><p>The term “horizon” is overloaded to mean two different things. There is the horizon an algorithm is trained to predict, then there is the horizon an algorithm is used to make prediction about in the future. Crucially, they are not always the same thing.</p><p>When you train a forecaster, some can only be trained to predict one step ahead. Others can be trained to predict a single value more than one step ahead. Another set of algorithms can be trained to predict several values ahead at once. Lets split it into train/fit and predict/forecast</p><h3>Training a forecaster (method fit(y))</h3><ol><li><strong>Train to predict one step ahead only. </strong>Traditional model based statistical algorithms are always trained to forecast one ahead. ARMA, ETS, Theta etc are basically designed to predict one step ahead. The only way I can think of sensibly training to predict further ahead would be to down sample, but I don’t think that&#39;s standard (although having just worked on implementing a threshold AR model, perhaps the link to time is not as vital as I thought).</li><li><strong>Train to predict more than one step ahead. </strong>Machine learning based forecasting and some deep learning forecasting (detailed below) is based on transforming forecasting to regression through a sliding window. This model is easily adapted to predict, say, five steps ahead by simply changing the lag of the predictor. It is important to note we are not predicting all the points up to horizon points. We train to predict exactly horizon points ahead which can be used with a direct strategy (see below).</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*CqeIQHF1ho580SnKWk9BRw.png" /></figure><pre>import pandas as pd<br>import numpy as np<br>from aeon.forecasting.stats import ARIMA<br>from aeon.forecasting import RegressionForecaster<br>from aeon.regression.interval_based import DrCIFRegressor<br><br>data = pd.read_csv(&quot;daily_steps.csv&quot;)<br>y = data[&quot;steps&quot;].values # Exclude the last value to match the length for prediction<br>print(y)<br>print(y.shape)<br>y_train = y[:-1]  # Use all but the last value for training<br>y_test = y[-1]  # Use the last value for testing<br>arma = ARIMA(p=1,q=1)<br>p = arma.forecast(y_train)  # One ahead forecast<br>print(f&quot; ARIMA predicts Tony will take {p} steps.&quot;)<br>reg = RegressionForecaster(window=50)  # Default is scikit-learn LinearRegression<br>p = reg.forecast(y_train)  # One ahead forecast<br>print(f&quot; Basic regression predicts Tony will take {p} steps tomorrow.&quot;)<br># You can use aeon time series specific regressors like DrCIFRegressor<br>reg = RegressionForecaster(regressor=DrCIFRegressor(n_estimators=10), window=50,<br>                           horizon=7)<br>p = reg.forecast(y_train)  # Seven ahead forecast<br>print(f&quot; DrCIF regression predicts Tony will take {p} steps on this day next week.&quot;)<br>print(&quot; Actual steps taken one ahead:&quot;, y_test)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1018/1*MjnUlWRL6-gr42d_SwNEOA.png" /><figcaption>Forecasts for Tony’s steps</figcaption></figure><p>In aeon we signify the ability to forecast more than one ahead with a boolean capability tag called<em> “capability:horizon”</em>. If set to true, the forecasting horizon can be set in the constructor, as shown in the example above.</p><p><strong>3. series to series forecasters</strong>.</p><p>Now we are in the deep learning world. Historically, with the rise of Sequence2Sequence modelling in Natural Language Processing, deep learning models started to use such modelling technique and train the model in an auto-regressive approach, meaning recursive prediction <strong>during training</strong> which is too slow.</p><p>Nowadays, deep learners can be trained to predict a whole series in one go. They use a window as with the point regressors, but the target is a whole series. Stats people would call this <strong>functional regression. </strong>Deep learning people don’t really have a name to differentiate it and call it all sorts of different things. Some just say “forecasting” some say “multi output” and “direct multi output” where some say explicitly Series2Series or seq2seq.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lfcCe_5hOnNcdsiy_QYkAQ.png" /><figcaption>Training a model to predict the full horizon using Series2Series deep learning model</figcaption></figure><p>That is the three forms of training. But what about prediction?</p><h3>Predicting a series of future values</h3><p>Suppose we want to forecast Tony’s steps for every day next week (rather than for tomorrow or for this day in a weeks time as in the example above). Simple with a series to series forecaster, just train to predict a window of seven. But there are two variants for point forecasters, and it is important to make the distinction between training the algorithm and using it to make predictions. In aeon, we specify three MixIns to give a forecaster the capability for iterative, direct or series to series forecasting to produce forecasts over a specified <strong>prediction horizon</strong>.</p><h3><strong>1. Iterative forecast</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bVbthVieaIJrvSr-bdmojg.png" /></figure><p>If a forecaster can only predict one ahead, how do we use it to predict seven ahead? One way is to fit one model, predict one ahead, then pretend the predicted value was the actual value, append it to the original series and feed it back to get the next forecast. This is often called a recursive strategy, but we don’t like that term (its not actually recursive)</p><pre>preds = np.zeros(prediction_horizon)<br>self.fit(y)<br>for i in range(0, prediction_horizon):<br>    preds[i] = self.predict(y)<br>    y = np.append(y, preds[i])<br>return preds</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*8Jo8LT02YaKETjVecJQzfw.png" /><figcaption>Iterative forecasting pipeline, every step, the prediction is added at the end of the input series and used to predict the next step, and so on…</figcaption></figure><p>Some forecasters cannot iteratively forecast in this way. For example ETS needs to track time forward, and ARIMA cannot work out residuals without knowing the true values. These forecasters override the method iterative_forecast rather than use the default version.</p><h3>2. Direct forecast</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1014/1*p1tJ6gNoO_25BKxhjiUSaw.png" /></figure><p>If a forecaster is capable of learning to forecast more than one ahead, an alternative is to fit a separate model for each horizon.</p><pre>preds = np.zeros(prediction_horizon)<br>for i in range(0, prediction_horizon):<br>    f = _clone_estimator(self)<br>    f.horizon = i + 1<br>    preds[i] = f.forecast(y, exog)<br>return preds</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1007/1*FmvxHSd0r0gq0B19ct1YHQ.png" /><figcaption>Direct forecasting pipeline, every step i, a new randomly initialized model is trained on the input series with horizon i+1, the last prediction of the horizon of each model is taken as the prediction of step i</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/862/1*8HQu4NfADTRpLXqV874Ryw.png" /></figure><p>For regression forecasters, this is achieved through making the regressor y horizon points ahead of the window like this</p><pre># Create windowed data for X<br>X = np.lib.stride_tricks.sliding_window_view(<br>    combined_data, window_shape=(combined_data.shape[0], self.window)<br>)<br>X = X.squeeze(axis=0)<br>X = X[:, :, :].reshape(X.shape[0], -1)<br><br># Ignore the final horizon values for X<br>X_train = X[: -self.horizon]<br><br># Extract y_train from the original series<br>y_train = y.squeeze()[self.window + self.horizon - 1 :]</pre><h3><strong>3. Series to Series forecasting</strong></h3><p>The mixin for series to series is simply this. We are working on bringing some of the deep learners into aeon, should have some in 1.3. Series to series forecasters could in theory be used with iterative or direct, but that would be a bit weird.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yLlpNPZcHFwPRh2TuXHkiw.png" /></figure><p>So back to Tony’s steps, these are the direct and iterative forecasts for the same regression forecaster to produce two weeks of forecast:</p><pre>reg = RegressionForecaster(window=50)  # Default is scikit-learn LinearRegression<br>steps = 14<br>series1 = reg.iterative_forecast(y_train, prediction_horizon=14)<br>series2 = reg.direct_forecast(y_train, prediction_horizon=14)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xi7YxOQGGq2p1n7Pk_vw0Q.png" /></figure><p>More on aeon forecasters is in <a href="https://github.com/aeon-toolkit/aeon/tree/main/examples/forecasting">our notebooks</a></p><h3>How this can confuse evaluation of forecasters</h3><p>The problem comes when strategies are mixed when forecasters are compared. To illustrate potential confusion over horizons, I’ll talk about the univariate forecasting experiments published in this paper.</p><p><a href="https://www.vldb.org/pvldb/vol17/p2363-hu.pdf">https://www.vldb.org/pvldb/vol17/p2363-hu.pdf</a></p><p>I think its a really good paper and VLDB is an A* conference. I am not trying to do a number on this work, I just want to use it to highlight possible confusion caused by mixing evaluation strategies.</p><p>For each of the 8064 univariate series, the test data is set to the last F points, set according to the sampling frequency as set in the M4 competitions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/937/1*yswSiddlPgKqHGkBqlLCDA.png" /></figure><p>They evaluate stats, machine learning and deep learning forecasters. For the machine learning algorithms Linear Regression (is that really machine learning?), XGBoost and Random Forest and they use windows of 1.25*F. This seems weirdly short to me, but is another legacy of M4. They use a <strong>“direct multi-output” </strong>strategy which is we believe simply a direct strategy: they fit different models for each horizon, each trained on series length of N-F. The evaluation of ML algorithms is really assessing how well LR, XGBoost and RF perform over a range of horizon.</p><p>The also compare stats models ARIMA, ETS, KF. For these, they do not use direct or iterative. Instead, they fit F models for using all the past data, i.e. N-F, N-F+1, … N-1. This is really a completely different strategy, and is useful for testing how well a forecaster can predict one ahead. Comparing these results to those for direct forecasting is not comparing like for like. <br>They also compare against 14 deep learning based models. RNN-based, CNN-based models (MICN, TimesNet, and TCN), MLP based models (NLinear, DLinear, TiDE, N-HiTS, and N-BEATS), Transformer-based models (PatchTST, Crossformer, and FEDformer, Non-stationary Transformers (Informer, and Triformer), and Model-Agnostic models (FiLM). All of these are we think treated as series to series forecasters in the paper and in implementations such as Nixtla (we think), but they were not necessarily designed to be used in that way.</p><p>Things then get even more confusing, because deep forecasters mix all three possible train regimes with different test mechanisms. The simplest are those that predict a single horizon ahead. These include <strong>DeepAR </strong>and <strong>Temporal Convolutional Network (TCN)</strong>. These could be used with a direct or iterative strategy, but are almost always used with iterative because of the cost of fitting.</p><p>With <strong>Direct Multi-Output Forecasting (DMO)</strong> forecasting, a deep learning model is trained to predict the full horizon during training. The internal loss function used in optimisation is based on the MSE of the whole predicted window. This is a series to series forecasting model, and distinct from a direct approach because you fit only a single model.</p><p>DMO often ignores dependencies between the future steps since it treats each output in the forecast horizon as conditionally independent of the others. This can lead to incoherent or inconsistent predictions over time, especially for long horizons where step-to-step relationships matter.<br>DMO models are often used to make very long horizon forecasts. Horizons of 720 steps based on windows of length 96 are common. See this<a href="https://neurips.cc/virtual/2024/108471"> excellent talk</a> by Christoph Bergmeir at a NeurIPS workshop on the limitations of this approach. <br>Examples include <strong>LSTNet</strong>, <strong>Temporal Fusion Transformer (TFT)</strong>, <strong>N-BEATS</strong>,, <strong>Informer</strong> and <strong>Autoformer</strong>.</p><p>Finally, <strong>Sequence to sequence (Seq2Seq)</strong> forecasting is arguably the most complex of the three DL approaches. Originating in NLP tasks like machine translation, Seq2Seq has been adapted for time series to learn a mapping between input sequences (past observations) and output sequences (future observations), essentially turning forecasting into a sequence transformation problem.</p><p>At first glance, this may seem similar to DMO/ series to series: both output full sequences, but there’s a crucial difference: Seq2Seq uses an autoregressive decoder, meaning each predicted value is conditioned on previous predictions. This allows the model to explicitly learn dependencies between future steps, something DMO does not do.<br>The architecture typically consists of:</p><ul><li><strong>An encoder</strong>: processes the input sequence and compresses it into a latent representation.<br>- <strong>A decoder</strong>: predicts future values step by step, feeding its own output back in at each step.</li></ul><p><strong>Example:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/617/1*kSHruiapVShJ_ZSQCXoQww.png" /><figcaption>Example showing how Seq2Seq models are trained through an encoder-decoder auto-regressive scheme with teacher forcing.</figcaption></figure><p>Suppose we have a time series: <em>{Y_1, Y_2, Y_3, Y_4, Y_5, Y_6, Y_7, Y_8}</em><br>We want to use the first 5 time steps as input to forecast the next 3 steps, so the forecast horizon H = 3<br>- The encoder takes {Y_1, Y_2, Y_3, Y_4, Y_5} and produces a context vector.<br>- The decoder does the following:<br>- At time step 1: takes the encoder output and true Y_5 to predict \hat{Y}_6<br>- At time step 2: takes \hat{Y}_6 and predicts \hat{Y}_7<br>- At time step 3: takes \hat{Y}_7 and predicts \hat{Y}_8</p><p><strong>Training (Teacher Forcing)</strong>:</p><p>During training, instead of using the model’s own (imperfect) predictions, the decoder is teacher-forced, it receives the true previous value at each step, so instead we have:<br>- At time step 2: uses true Y_6 to predict \hat{Y}_7<br>- At time step 3: uses true Y_7 to predict \hat{Y}_8<br>You might wonder how the loss is computed in this case. Even though we use true values as inputs during training, the loss is computed between {Y_6, Y_7, Y_8} and {\hat{Y}_6, \hat{Y}_7, \hat{Y}_8}.<br>Yes, we are “cheating” a bit by using the ground truth to guide predictions, but this makes training more stable. And since it’s training data, who cares, right?</p><p><strong>Inference/predict</strong>:<br>During inference, teacher forcing is not available, so the decoder must use its own previous predictions to generate each next value. This can lead to error propagation, but the autoregressive structure allows the model to learn coherent, temporally-dependent sequences, something DMO cannot do.<br>Examples: <strong>Seq2Seq LSTM, DARNN</strong></p><p>That said, Seq2Seq models aren’t as widely used for time series forecasting anymore. The main reason is what happens at inference time: since the decoder has to rely on its own previous predictions (rather than the ground truth like during training), any small error can easily snowball as the forecast progresses, especially over long horizons. This mismatch between training (with teacher forcing) and inference (without it) often leads to degraded performance. On top of that, autoregressive decoding is slower because predictions are made one step at a time. These days, most modern models just predict the entire future horizon in a single shot (aka DMO). It’s faster, avoids error accumulation, and usually performs better, especially for long-term forecasting.</p><p>So, that’s deep learning: either point horizon with an iterative approach or series to series via two different training mechanisms. back to the paper, the results for Table 6 compare iterative, direct and series to series forecasters all together, in addition to stats forecasters that can see into the future. This is confusing and conclusions drawn from the results may not hold true. It would we believe be better to compare like with like whenever possible.</p><p>We think that evaluation should control as many factors of variation as possible to detect meaningful differences. Both ML and Stats models could adopt an iterative approach, and this is a more natural approach for comparing to DL algorithms. However, our first research question is: given reasonably long series and no other prior knowledge, what can we infer about the relative performance of one ahead forecasters?</p><p>These are the experiments we are working on, and the results may be interesting …. aeon forecasting module is developing rapidly. We have two Google Summer of Code students, Tina and Balgopal, working on expanding the machine learning and deep learning sides. Alex Banwell is working on the stats module for his PhD and presenting his first paper at ECML in September.</p><p>Watch this space…</p><p>By Tony Bagnall and Ali Ismail-Fawaz. Both are aeon core developers. Tony is a professor of Computer Science at University of Southampton. Ali is a postdoctoral researcher at the Université de Haute-Alsace.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=da426b7a6c24" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Leveraging aeon’s for streamlined time series algorithm development]]></title>
            <link>https://medium.com/@aeon.toolkit/leveraging-aeons-for-streamlined-time-series-algorithm-development-d9bbfdb49fd1?source=rss-6ea57a9bd09a------2</link>
            <guid isPermaLink="false">https://medium.com/p/d9bbfdb49fd1</guid>
            <category><![CDATA[open-source]]></category>
            <category><![CDATA[time-series-analysis]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Aeon Toolkit]]></dc:creator>
            <pubDate>Wed, 06 Aug 2025 14:25:57 GMT</pubDate>
            <atom:updated>2025-08-06T14:25:57.928Z</atom:updated>
            <content:encoded><![CDATA[<h3>Leveraging aeon for streamlined time series algorithm development</h3><p>Author: <a href="https://github.com/baraline">Antoine Guillaume</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*L283L6x2Jm0vEKQP" /><figcaption>Photo by <a href="https://unsplash.com/@aronvisuals?utm_source=medium&amp;utm_medium=referral">Aron Visuals</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><p>Tired of wasting time on the boring stuff like input type checking and conversion when all you want to do is build awesome time series algorithms? Well, you’re in luck, aeon is here to help!</p><p>In this post, we’re diving into how aeon’s base class structure can make your development process smoother and more robust. Imagine being able to focus purely on innovation while aeon takes care of all the stuff related to input type checking, conversion, etc. Sounds great, right?</p><p>Whether you’re a seasoned time series practitioner or just starting out, aeon’s got something for everyone. Ready to make your coding life a lot better? Let’s get started!</p><h4>Wait, what is aeon?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tvjqUBbBNUJT46G6Ocnvwg.png" /></figure><p>It’s a swiss army knife for time series data. It provides algorithms to solve most time series problems like forecasting, classification, anomaly detection, tools to help you load, process, and evaluate your time series, and much more.</p><p>It is a <a href="https://numfocus.org/sponsored-projects/affiliated-projects">NumFocus affiliated open-source project</a> maintained by a community of developers from both industrial and academic backgrounds working on the field of time series. It aims at being the go-to toolkit for all time series related problems. aeon is designed to work with other popular tools like scikit-learn, so if you’re already familiar with it, you’ll feel right at home!</p><p>Want to know more about the project? Check out the <a href="https://www.aeon-toolkit.org/en/stable/">documentation website</a> or the <a href="https://github.com/aeon-toolkit/aeon">GitHub</a> repository!</p><h3>How aeon helps you to build time series machine learning algorithms</h3><p>aeon uses a core inheritance hierarchy of classes across the toolkit, with specialised subclasses in each module. Your goal will be to identify the base class that suits the time of algorithm you want to develop, and make your new class inherit from it. The basic class hierarchy is shown in the following diagram. To make sense of this, we will break it down from the top.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*o-e-xuIdd1Nswlv8kQsICw.png" /><figcaption>How aeon’s base class hierarchy is organized</figcaption></figure><h4>Scikit-learn and aeon base estimator</h4><p>Everything in aeon inherits from sklearn BaseEstimator, which mainly handles the mechanisms for getting and setting parameters using the set_params and get_params methods. These methods are used when the estimators interact with other classes, such as <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV">GridSearchCV</a>.</p><p>Then we have the aeon BaseAeonEstimator class. This class handles the following for all aeon&#39;s estimator:</p><ul><li>management of tags, setting, getting, interaction with sklearn’s tags, etc.</li><li>cloning and resetting of the estimator</li><li>creation of test instances using test parameters specified by each estimator. For example, this is used to define fast-running configuration (e.g. a forest classifier with only 2 trees) for a CI/CD pipelines.</li></ul><h4>A word on aeon’s tag system</h4><p>Tags in aeon are used for various purposes. To display estimators capabilities in the documentations, to use specific tests based on each estimator’s capabilities, and more. You can check <a href="https://github.com/aeon-toolkit/aeon/blob/main/aeon/utils/tags/_tags.py">all existing tags in aeon</a> and the <a href="https://www.aeon-toolkit.org/en/stable/developer_guide/testing.html#">developer documentation on the testing framework</a> to know more about how we use tags for testing purposes.</p><pre>class RandomDilatedShapeletTransform(BaseCollectionTransformer):<br><br>    ...<br><br>    # Example of the tags defined by the RDST transformation<br>    _tags = {<br>        &quot;output_data_type&quot;: &quot;Tabular&quot;,<br>        &quot;capability:multivariate&quot;: True,<br>        &quot;capability:unequal_length&quot;: True,<br>        &quot;capability:multithreading&quot;: True,<br>        &quot;X_inner_type&quot;: [&quot;np-list&quot;, &quot;numpy3D&quot;],<br>        &quot;algorithm_type&quot;: &quot;shapelet&quot;,<br>    }</pre><p>One of the <strong>main use of tags is input or output formatting and checking </strong>made by base classes. Some important tags in this regard are :</p><ul><li>X_inner_type, used to specify the expected type of the input data (Arrays, DataFrames, Lists) used in the implementation. For example, this allows your algorithm to take both numpy arrays and pandas DataFrames as input, while your code uses numpy arrays only.</li><li>output_data_type, used to specify the expected type of the output data (e.g. tabular, series, collections). This is mostly used by transformations estimators.</li><li>capability:multivariate, which indicates whether the estimator can handle multivariate time series data.</li><li>capability:unequal_length, which indicates whether the estimator can handle collection of unequal length time series data.</li><li>capability:multithreading, which indicates whether the estimator can handle multithreading for parallel processing.</li></ul><p>Tags can also be used to indicate behaviours and required dependencies that might be optional in the default installation of your package, one of such example for aeon is the LiteTimeClassifier :</p><pre>class LITETimeClassifier(BaseClassifier):<br><br>    ...<br><br>    # Example of the tags defined by the LITE Classifier<br>    _tags = {<br>        &quot;python_dependencies&quot;: &quot;tensorflow&quot;,<br>        &quot;capability:multivariate&quot;: True,<br>        &quot;non_deterministic&quot;: True,<br>        &quot;cant_pickle&quot;: True,<br>        &quot;algorithm_type&quot;: &quot;deeplearning&quot;,<br>    }</pre><p>We will give more practical examples on how to use these tags in the following sections.</p><h4>Collection estimator and Series estimator</h4><p>We distinguish between two types of inputs for aeon estimators, series and collections:</p><ul><li><strong>Series represent single time series as a 2D format </strong><strong>(n_channels, n_timepoints)</strong>, some estimators can also use 1D format as (n_timepoints) when they don&#39;t support multivariate series. Series estimators also have an axis parameter, which allow the input shape to be transposed such as the 2D format becomes (n_timepoints, n_channels) instead.</li><li><strong>Collections represent an ensemble of time series as a 3D format </strong><strong>(n_samples, n_channels, n_timepoints)</strong>. Again, this can sometime be represented as a 2D format, such as (n_samples, n_timepoints) for univariate estimators. Preferably, this should be avoided to clear any confusion on the meaning of axes and the possible confusion with 2D single series.</li><li>For the case of an <strong>unequal length collection</strong>, we use a list of 2D arrays, where the n_timepoints can vary for each sample, this is was is referred to as a np-listin the RDST code example.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oTUdOkYeLqit-UpkT--n7w.png" /></figure><p>For example, if we go back to the base class schema BaseClassifier inherit from BaseCollectionEstimator. This means that during fit and predict, all estimators inheriting from BaseClassifier will expect time series collection as inputs.</p><h3>Collection estimators</h3><p>The BaseCollectionEstimator defines methods to check the shape of the input, extract metadata (e.g. whether the collection is multivariate) and check compatibility of the input against tags of the estimators. For example, when you do the following :</p><pre>from aeon.classification.dictionary_based import TemporalDictionaryEnsemble<br>from aeon.testing.data_generation import make_example_3d_numpy_list<br># TDE does not support unequal length collections<br># as it sets &quot;capability: unequal_length&quot;:False<br>X_unequal, y_unequal = make_example_3d_numpy_list()<br>try:<br>    TemporalDictionaryEnsemble().fit(X_unequal, y_unequal)<br>except ValueError as e:<br>    print(e)</pre><pre>Data seen by instance of TemporalDictionaryEnsemble has unequal length series,<br>but TemporalDictionaryEnsemble cannot handle these characteristics.</pre><p>What happens here is that TemporalDictionaryEnsemble inherits from BaseClassifier, which itself inherits from BaseCollectionEstimator. During fit and predict, BaseClassifier calls _preprocess_collection, a function defined in BaseCollectionEstimator. This function extracts the input metadata (whether it is multivariate, of unequal lengths etc.) and compare it against TemporalDictionaryEnsemble tags. These state that the estimator does not support unequal lengths collections, and hence an exception is raised.</p><h4>Classification</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yfRsKNAo4lIHZq7Lind1Mg.png" /><figcaption>Illustration of the time series classification task</figcaption></figure><p>This is the base class for all classifiers. It uses the standard fit, predict and predict_proba structure from sklearn. fit and predict call the abstract methods _fit and _predict which are implemented in the subclass to define the classification algorithm. All the common format checking and conversion are done in final functions such as fit, predict and are made before calling the abstract methods _fit and _predict.</p><p>When implementing a new classifier inheriting from BaseClassifier, you thus only have to implement the __init__, _fit and _predict methods that handle the classification logic of the classifier. You will also need to set the correct tags to allow the check and conversion to be done for you. Note that each base class also defines some attributes that are commonly used in the estimators, for example BaseClassifier exposes classes_, n_classes_, _class_dictionary that we can use in our new classifier:</p><pre>from numpy.random import default_rng<br>from aeon.classification import BaseClassifier<br>from aeon.testing.data_generation import (<br>    make_example_3d_numpy,<br>    make_example_dataframe_list,<br>)<br><br>class RandomClassifier(BaseClassifier):<br>    &quot;&quot;&quot;A dummy classifier returning random predictions.&quot;&quot;&quot;<br>    _tags = {<br>        &quot;capability:multivariate&quot;: True,  # allow multivariate collections<br>        &quot;capability:unequal_length&quot;: True,  # allow multivariate collections<br>        &quot;X_inner_type&quot;: [&quot;np-list&quot;, &quot;numpy3D&quot;],  # Specify data format used internally<br>    }<br>    def __init__(self, random_state: int | None = None):<br>        self.random_state = random_state<br>        super().__init__()<br>    def _fit(self, X, y):<br>        self.rng = default_rng(self.random_state)<br>        return self<br>    def _predict(self, X):<br>        # generate a random int between 0 and n_classes-1 and<br>        # use _class_dictionary to convert it to class label<br>        return [<br>            self._class_dictionary[i]<br>            for i in self.rng.integers(low=0, high=self.n_classes_, size=len(X))<br>        ]<br># A 3D numpy array <br>X, y = make_example_3d_numpy(n_channels=2)<br>print(RandomClassifier().fit_predict(X, y))<br># A list of dataframes, each representing a 2D Series<br>X, y = make_example_dataframe_list()<br>print(RandomClassifier().fit(X, y).predict(X))</pre><pre>[1 0 1 1 0 0 1 0 1 0]<br>[0 1 1 0 0 1 0 1 0 0]</pre><p><strong>Further reading</strong></p><ul><li><a href="https://www.aeon-toolkit.org/en/stable/examples/classification/classification.html">Introduction to time series classification using aeon</a></li></ul><h4>Regression, Clustering and Anomaly detection</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*p6M9o929NfHdKjJHdJwbug.png" /><figcaption>Illustration of the time series regression task</figcaption></figure><p>These base classes are mostly similar to BaseClassifier in how they use the checks and conversion operations from BaseCollectionEstimator.</p><ul><li>BaseRegressor also defines a fitand predict method and requires _fit and _predict methods to be implemented by child classes. The difference is that it has no predict_proba method yet, as we still need to decide how to model probabilistic regression for time series. The tests on y are also different, as we can have floats has values for y.</li><li>BaseClusterer also has fit and predict, but does not take input y as child classes can be unsupervised estimators. It does include predict_proba.</li><li>BaseCollectionAnomalyDetector also has fit and predict, but does not take input y as child classes can be unsupervised estimators.</li></ul><p><strong>Further reading</strong></p><ul><li><a href="https://www.aeon-toolkit.org/en/stable/examples/regression/regression.html">What is time series extrinsic regression ?</a></li><li><a href="https://www.aeon-toolkit.org/en/stable/examples/clustering/clustering.html">Time series clustering using aeon</a></li><li><a href="https://www.aeon-toolkit.org/en/stable/examples/anomaly_detection/anomaly_detection.html">Time series anomaly detection using aeon</a></li></ul><h4>Collection transformation</h4><p>Rather than fit andpredict, the BaseCollectionTransformer implements fit, transform and fit_transform methods, which are required since this base class also inherits the BaseTransformer class. It will require child classes to define _fitand _transform methods. The output of the transform method is not fixed and should be specified with the output_data_type.</p><p>For example, if the output is another collection of time series (e.g. after using SAX), then output_data_type must take the Collection value (note that this is the default value for all BaseCollectionTransformer child classes). If the output is not time series anymore, but rather a 2D array of features extracted from each input time series, such as in Rocket or RandomShapeletTransform, then the output_data_type must take the Tabular.</p><p><strong>Further reading</strong></p><ul><li><a href="https://www.aeon-toolkit.org/en/stable/examples/transformations/transformations.html">Time series transformation with aeon</a></li></ul><h3>Series estimators</h3><p>Series estimators are similar to collection estimators, but they take single time series as input. They inherit from BaseSeriesEstimator, which perform similar operations to BaseCollectionEstimator, such as input checks and conversions, but for single time series.</p><p>One important difference is the use of the axis parameter, which allows each estimator to define whether it works with the (n_channels, n_timepoints) or the (n_timepoints, n_channels) 2D format. We need to have the axis parameter, because for 2D series, there is no way to infer whether an input 2D time series is in the (n_channels, n_timepoints) or (n_timepoints, n_channels) format.</p><p>To understand its uses, we need to distinguish between the axis parameter set during initialisation of the estimator, and the axis parameter used in the fit, predict and other methods:</p><ul><li>axis given during initialisation is used to define the internal format used by the estimator,</li><li>axis given when call functions is used to transpose the input time series if needed, to match the format internally used by the estimator.</li></ul><p>Note that the axis value represent the dimension in which the number of timepoints is stored, so axis=0 means that the timepoints are stored in the first dimension, and axis=1 means that the timepoints are stored in the second dimension (i.e. (n_channels, n_timepoints)).</p><p><strong>Further reading</strong></p><ul><li><a href="https://www.aeon-toolkit.org/en/stable/examples/base/series_estimator.html">aeon series estimators</a></li></ul><h4>Series transformation</h4><p>Let’s take the example of BaseSeriesTransformer, which is the base class for all series transformers. It implements the fit, transform and fit_transform methods, and requires child classes to implement _fit and _transform. Let use demonstrate the use of the axis parameter with an example:</p><pre>from aeon.testing.data_generation import make_example_dataframe_series<br>from aeon.transformations.series import BaseSeriesTransformer<br>class DummySeriesTransformer(BaseSeriesTransformer):<br>    &quot;&quot;&quot;A dummy series transformer that keeps every second timepoint.&quot;&quot;&quot;<br>    _tags = {<br>        &quot;capability:multivariate&quot;: True,  # allow multivariate series<br>        &quot;X_inner_type&quot;: &quot;pd.DataFrame&quot;,  # Specify data format used internally<br>        &quot;fit_is_empty&quot;: True,  # we don&#39;t need to define _fit<br>    }<br>    def __init__(self):<br>        super().__init__(axis=1)  # Set axis to 1 for (n_channels, n_timepoints) format<br>    def _transform(self, X, y=None):<br>        print(X.shape)<br>        X = X.iloc[:, ::2]  # Example transformation: keep every second timepoint<br>        print(X.shape)<br>        return X<br><br>X = make_example_dataframe_series(n_channels=2, n_timepoints=10).T<br>print(X.shape)  # Is (n_timepoints, n_channels), which is axis=0<br>print(DummySeriesTransformer().fit_transform(X, axis=0).shape)</pre><pre>(10, 2)<br>(2, 10)<br>(2, 5)<br>(5, 2)</pre><p>What we see is that X starts of shape (n_timepoints, n_channels), which is the default format for the input time series.</p><p>The DummySeriesTransformer is initialised with axis=1, meaning that internally, when calling fit and transform, the input is converted to (n_channels, n_timepoints) format before being passed to the _fit and _transform functions. The output is then converted back to the user specified format axis=0 : (n_timepoints, n_channels) before returning it.</p><p>Note that as we specified pd.DataFrame as the X_inner_type in the tags, if the input is not a DataFrame, the estimator will convert it to a DataFrame before applying the transformation. This allows you to define and use a single input shape and type in the function you implement, while still allowing the estimator to handle different input formats.</p><pre>from aeon.testing.data_generation import make_example_2d_numpy_series<br>X = make_example_2d_numpy_series(n_channels=1, n_timepoints=10)<br>transformer = DummySeriesTransformer()<br>print(transformer.fit_transform(X, axis=1).shape)</pre><pre>(1, 10)<br>(1, 5)<br>(1, 5)</pre><h4>Forecasting</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JdzXRmwmVyM4_o1loo4FDA.png" /><figcaption>Illustration of the task of time series forecasting</figcaption></figure><p>The BaseForecaster class inherits from BaseSeriesEstimator which provides the checks and conversion functions for single time series inputs. Forecasters predict future values of a time series horizon steps ahead. The horizon parameter is defined during initialisation to specify how many steps ahead the forecaster should predict. The main methods are :</p><ul><li>fit(y, exog=None): Trains the forecaster on input data</li><li>predict(y, exog=None): Makes predictions on new data</li><li>forecast(y, exog=None): Combines fit and predict into one operation</li></ul><p>For child classes, the _fit and _predict methods must be implemented to define the forecasting logic. It also provides two main forecasting strategies:</p><ul><li>direct_forecast(): Makes predictions by training separate models for each future step.</li><li>iterative_forecast(): Makes predictions recursively using a single trained model.</li></ul><p><strong>Further reading</strong></p><ul><li><a href="https://www.aeon-toolkit.org/en/latest/examples/forecasting/forecasting.html">Time Series Forecasting with aeon</a></li></ul><h4>Anomaly detection and Segmentation</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4W0NtIAdJcHSBw8DazVDkA.png" /></figure><p>The BaseSeriesAnomalyDetector and the BaseSegmenter base classes both implements the fit, predict but only requires child classes to implement the _predict, as most anomaly detector and segmenters are unsupervised estimators. Both also inherit from BaseSeriesEstimator, which provides the checking and conversion functions we have seen before.</p><p><strong>Further reading</strong></p><ul><li><a href="https://www.aeon-toolkit.org/en/stable/examples/segmentation/segmentation.html">Introduction to time series segmentation</a></li></ul><h3>What’s next?</h3><p>Now, you can start experimenting and making your own estimators more robust and field-ready with aeon base classes and tags. If you have questions or would like to help us in expanding aeon, you can come have a chat with us on <a href="https://join.slack.com/t/aeon-toolkit/shared_invite/zt-36dlmbouu-vajTShUYAHopSXUUVtHGzw">Slack</a>, have a look at open issues on <a href="https://github.com/aeon-toolkit/aeon/issues">GitHub</a>!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d9bbfdb49fd1" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Blazingly fast ARIMA with aeon]]></title>
            <link>https://medium.com/@aeon.toolkit/blazingly-fast-arima-with-aeon-9796d7015cdd?source=rss-6ea57a9bd09a------2</link>
            <guid isPermaLink="false">https://medium.com/p/9796d7015cdd</guid>
            <category><![CDATA[benchmarking]]></category>
            <category><![CDATA[arima]]></category>
            <category><![CDATA[arima-model]]></category>
            <category><![CDATA[time-series-forecasting]]></category>
            <category><![CDATA[forecasting]]></category>
            <dc:creator><![CDATA[Aeon Toolkit]]></dc:creator>
            <pubDate>Sun, 03 Aug 2025 16:21:56 GMT</pubDate>
            <atom:updated>2025-08-04T19:52:52.764Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/721/1*xSgJ59loKobp9i1IGjua7g.png" /></figure><p><strong>tl:dr; we compare the new </strong><a href="https://github.com/aeon-toolkit/aeon">aeon</a><strong>forecaster with the python implementations in </strong>statsmodels<strong>and </strong>statsforecast<strong>, and show that it is two orders and one order of magnitude faster respectively, with no loss in predictive performance. The </strong><a href="https://github.com/aeon-toolkit/aeon">aeon</a><strong>ARIMA forecaster is blazingly fast, but comes with some caveats…</strong></p><p>Some <a href="https://github.com/aeon-toolkit/aeon">aeon</a>core developers have begun dabbling with forecasting. We are all computer scientists, so its a bit outside our comfort zone, which is good. The task that interests us is short horizon forecasting of long series, as we think this is where we may be able to advance the field. Ultimately we are interested in machine learning algorithms, but new algorithms should be benchmarked with the traditional approaches, so we are starting with classical techniques. Broadly speaking, there are three classes of forecasters: statistical models, machine learning algorithms and deep learning approaches. There is obviously overlap, but there is a clear distinction in terms of the communities that work in each area and the specification of problem they address. This is a whole blog post in itself and for another day. In this article we will describe our first steps in developing statistical algorithms for forecasting in python. Specifically, we will give an overview of the current stats forecasting software in python, describe our implementations of some favourites like Auto Regressive Integrated Moving Average (ARIMA) and Exponential Smoothing (ETS), then present benchmarking results against the current most popular approaches. We hope this encourages you to try our forecasters out. We are not going to fill the post with standard descriptors of the algorithm. This information is freely available elsewhere. This post is of interest to someone wanting to use fast, simple implementations of stats based forecasters. This blog focuses on the <a href="https://github.com/aeon-toolkit/aeon">aeon</a> ARIMA implementation, but we are adding more classical algorithms (e.g. Theta and TVP) and will look in depth at ETS in another article.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/349/1*K5VhT-4jWlbgcuISSFMR2Q.png" /><figcaption><a href="https://github.com/aeon-toolkit/aeon">aeon</a> is an open source toolkit for time series machine learning that aims to bring the latest state of the art for time series classification, clustering, regression, segmentation, anomaly detection, similarity search and forecasting directly to practitioners.</figcaption></figure><h3>Forecasting software for ARIMA and ETS</h3><p>The oldest and most widely known package is statsmodels. It seems to have been the basis for all python based stats forecasting for the last decade. Historically, most forecasting software is in R. We are not using R for reasons beyond the scope of this article. statsmodels fits ARIMA and ETS like this:</p><pre>from aeon.datasets import load_airline<br>from statsmodels.tsa.holtwinters import ExponentialSmoothing<br>from statsmodels.tsa.arima.model import ARIMA as SM_ARIMA<br>y = load_airline().squeeze() # Make a 1D numpy<br># ARMA Model<br>arma = SM_ARIMA(y[:-1], order=(1, 0, 1)).fit()<br>forecast = arma.forecast(steps=1)[0]<br>print(f&quot;Actual = {y[-1]} Forecast = {forecast}&quot;)<br># ETS Models<br>ets = ExponentialSmoothing(y,trend=&quot;add&quot;, seasonal=&quot;add&quot;, seasonal_periods=12).fit()<br>forecast = ets.forecast(steps=1)[0]<br>print(f&quot;Actual = {y[-1]} Forecast = {forecast}&quot;)</pre><p>statsmodels has no native automatic hyper-parameter fitting algorithm: it has to be done manually through grid search or bespoke methods. Wrappers like pmdarma, sktime and Darts encapsulate this process, but all use statsmodels implementations of the core algorithm. These packages offer other features, such as pipelines and transforms, but they do not have bespoke ETS and ARIMA implementations, which is our requirement.</p><p>statsforecast is a more recent package from Nixtla who apparently have an <em>“industry-leading time series engine”</em>. It contains bespoke implementations of both ARIMA and ETS, and also has auto versions to fit the hyper-parameters. To fit a model, you have to wrap it in a DataFrame</p><pre>from statsforecast import StatsForecast<br>from statsforecast.models import ARIMA, AutoARIMA, AutoETS<br>import pandas as pd<br>df = pd.DataFrame({<br>    &quot;ds&quot;: pd.date_range(start=&quot;1900-01-01 00:00:00&quot;, periods=len(y),<br>                        freq=&quot;H&quot;),<br>    &quot;y&quot;: y,<br>    &quot;unique_id&quot;: &quot;ts&quot;<br>})<br>sf = StatsForecast(models=[ARIMA(order=(1,0,1))],freq=&quot;H&quot;)<br>forecast_df = sf.forecast(df=df, h=1)<br>f1 = forecast_df.iloc[0][&quot;ARIMA&quot;]<br>sf = StatsForecast(models=[AutoETS()],<br>                   freq=&quot;H&quot;)<br>forecast_df = sf.forecast(df=df, h=1)<br>f2 = forecast_df.iloc[0][&quot;AutoETS&quot;]<br>sf = StatsForecast(models=[AutoARIMA()],<br>                   freq=&quot;H&quot;)<br>forecast_df = sf.forecast(df=df, h=1)<br>f3 = forecast_df.iloc[0][&quot;AutoARIMA&quot;]</pre><h3>Why have another implementation?</h3><p>So as far as I can tell, that’s it: either use statsmodelsand/or their various wrappers, or use statsforecastand their DataFrame model.</p><p>Our research requirement is fast implementations of classical algorithms in order to benchmark against the new techniques we are developing. We are not interested in all possible variants of algorithms that have ever been proposed. Neither of the existing packages fulfil our need. Whilst comprehensive, statsmodelsis designed like R and is quite slow. statsforecastis faster, but is built around DataFrames. We will implement forecasters that are:</p><ol><li>Built around numpy arrays and numba;</li><li>Simple to use and concise in code;</li><li>As fast as we can make them.</li></ol><p>ARIMA and ETS have meta parameters and parameters. For ARIMA, the meta parameters are (p,d,q), the number of autoregressive, differences and moving average terms.</p><h3>Fitting the model</h3><p>First source of confusion is that there are two steps to fitting a model. Fitting the hyper parameters (p, d and q for ARIMA) and fitting the parameters (usually called theta and phi)</p><p>An optimiser can be used to fit parameters and/or meta parameters for ARIMA and ETS. statsmodelshas a wide range of optimisers. chatgpt tells me it has these:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/394/1*TV-5qF0lelTKEYC-YSNiLQ.png" /></figure><p>Apparently, for ARMA it uses a Kalman filter statespace model fit to maximise likelihood using “L-BFGS-B” (via scipy.optimize.minimize) optimiser. blimey. ETS in statsmodelsdoes something similar. Personally I was suprised they don’t just use stochastic gradient descent, but old school numerical analysis techniques are cool, so we will go with that.</p><p>statsforecast takes a different approach (so my LLM tells me). For ARMA both AR and MA terms are solved via closed-form least squares, and ETS uses discrete grid search. Now, I think I remember how to fit an AR model without optimisation, you simply find the PACF and use the diagonal terms, stopping with some criteria such as AIC, but not sure about ARMA. Further searching and digging suggests chatGPT is mistaken, ARIMA at least uses BFGS (via optim() from base R) but its not at all clear.</p><p>We chose to implement Nelder-Mead optimiser for <a href="https://github.com/aeon-toolkit/aeon">aeon</a>. Why? First, we read a linkedin post recommending it (which I can no longer find, maybe from Ivan?). Secondly, it looked simpler than the alternatives to implement. Thirdly, it seemed less widely used, so could differentiate us from the others. Finally, it seemed quite easy to generalise to any parameter fitting.</p><p>For the hyper parameters such as (p,d,q), the space of meta parameters is so small, it seems to me sensible to just enumerate them. A range of [0…5) for p and q and [0..3) for d is only 75 models and the fitting is very fast. It seems like statsmodels and statsforecastuse an optimiser, possibly some form of forward selection, but we will only do so if it seems necessary.</p><p>So, decisions made, we have implemented everything with numpy arrays, speed up with cached numba and use a lightweight version of Nelder-Mead to fit parameters and exhaustive search to fit a limited number of meta parameters. Will it be any good? A little about our forecasting base class first.</p><h3>Forecasting in aeon</h3><p>One thing that we had to work through is the different use cases in forecasting. I’m writing a separate post about horizons in forecasting, because if we are confused others may be too. Our notebooks on forecasting are <a href="https://github.com/aeon-toolkit/aeon/tree/main/examples/forecasting">here </a>. We do use fit and predict, but our simplest use case is:</p><p><strong>d.forecast(y)</strong>: fit model and make a single prediction.</p><p><strong>d.iterative_forecast(y, prediction_horizon=12)</strong>: fit a single model, then predict 12 steps ahead, feeding predictions back into the model</p><p><strong>d.direct_forecast(y, prediction_horizon=12)</strong>: fit 12 different models, each trained to predict a different number of steps.</p><pre>from aeon.datasets import load_airline<br>from aeon.forecasting import NaiveForecaster, RegressionForecaster<br>y = load_airline()<br>d = NaiveForecaster(strategy=&quot;last&quot;)<br>p = d.forecast(y)   # One ahead forecast<br>d = NaiveForecaster(strategy=&quot;seasonal_last&quot;, seasonal_period=12)<br>p2 = d.iterative_forecast(y, prediction_horizon=12)<br>d2 = RegressionForecaster(window=100) # Defaults to LinearRegression<br>p3 = d2.direct_forecast(y, prediction_horizon=12)</pre><p>With ARIMA and ETS, you cannot simply train models to predict more than one ahead, so they do not implement direct_forecast.</p><p>These are statsforecastconstructor arguments. Its great that it is so configurable, but it is a little confusing to the uninitiated and all these parameters introduce overhead in validation etc.</p><pre>AutoARIMA (d:Optional[int]=None, D:Optional[int]=None, max_p:int=5,<br>            max_q:int=5, max_P:int=2, max_Q:int=2, max_order:int=5,<br>            max_d:int=2, max_D:int=1, start_p:int=2, start_q:int=2,<br>            start_P:int=1, start_Q:int=1, stationary:bool=False,<br>            seasonal:bool=True, ic:str=&#39;aicc&#39;, stepwise:bool=True,<br>            nmodels:int=94, trace:bool=False,<br>            approximation:Optional[bool]=False, method:Optional[str]=None,<br>            truncate:Optional[bool]=None, test:str=&#39;kpss&#39;,<br>            test_kwargs:Optional[str]=None, seasonal_test:str=&#39;seas&#39;,<br>            seasonal_test_kwargs:Optional[Dict]=None,<br>            allowdrift:bool=True, allowmean:bool=True,<br>            blambda:Optional[float]=None, biasadj:bool=False,<br>            season_length:int=1, alias:str=&#39;AutoARIMA&#39;, prediction_interva<br>            ls:Optional[statsforecast.utils.ConformalIntervals]=None)</pre><p>Our ARIMA and AutoARIMA model constructors look like this</p><pre>class ARIMA(BaseForecaster):<br><br>def __init__(<br>    self,<br>    p: int = 1,<br>    d: int = 0,<br>    q: int = 1,<br>    use_constant: bool = False,<br>    iterations: int = 200,<br>):</pre><p>We ran an experiment to compare our ARMA to statsmodelsand statsforecast versions.</p><p>The set up is this. We specified 20 ARMA models with p and q between 0 and 5 and parameters set to ensure stationarity. We are not considering differenced data since all algorithms do this in the same way.</p><p>We pick a random model and generate a series, then z-normalise it. We train on n-1 points, and make a single forecast. For each n, we repeat this process 100 times, and calculate the MSE for the three toolkit implementations.</p><p>The graphs below present the results. On the left as you look are the results for fitting a model with the correct p and q values. On the right the results for a standard default for p=1 and q=1. The baseline results are very similar. There is variation for the true models, with <a href="https://github.com/aeon-toolkit/aeon">aeon</a>showing more variation than the others and seemingly worse on longer series. We investigate if this effect is significant next. The timing results are quite impressive. <a href="https://github.com/aeon-toolkit/aeon">aeon</a>ARMA is an order of magnitude faster than statsforecastand two orders faster than statsmodels.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bAJKlS1yyu6okwkvwykBsA.png" /></figure><p>To assess whether there is any significant difference in MSE we ran a separate experiment using just statsforecast(because we are impatient). We fix n now to 2500 (seemingly the worst value for <a href="https://github.com/aeon-toolkit/aeon">aeon</a>) then create 500 series for our 20 models, fit <a href="https://github.com/aeon-toolkit/aeon">aeon</a>and statsforecastARIMA as before (with the correct p and q values) then measure the error of the last forecast point for each. We have included a Naive forecaster for a sanity check forecaster.</p><p>The short story is there is no significant difference. The average number of wins per model for <a href="https://github.com/aeon-toolkit/aeon">aeon</a> was 250.25/500. The average pairwise difference over 20 problems was -0.0000000273. <a href="https://github.com/aeon-toolkit/aeon">aeon</a> was better on 10 problems, statsforecast on 10 problems. The overall MSE for the two approaches is 0.678569 and 0.678584 (the MSE for Naive is 1.16). I could present graphs etc, but its all saying the same thing: To all intents and purposes, they are the same. <a href="https://github.com/aeon-toolkit/aeon">aeon</a> took 50 seconds to run the experiment, statsforecast took over 10 minutes. It is perfectly possible we are not using the super fast version of statsforecast , I heard mention of a precompiled C++ version, but we are following their instructions on usage.</p><p>Nelder-Mead is very fast, but not without issues. Sometimes it converges to a very poor model which gives massive MSE. This happens rarely for the simulated data we used, maybe once every few thousand runs, but its worth being aware of it. Basically its sensitive to the initial conditions. We have altered the default initial conditions to make it more robust: it did not happen in the above experiments. However, given how fast it is, we might fit twice with different initial simplex and take the best. This would make the chance of a pathological model very small indeed. Further, ultimately we will use it with an AutoARIMA forecaster. In this case, it would just mean the pathological model would not be selected.</p><p>All this is very preliminary and will evolve into a paper describing the aeon forecasting module, we have a lot planned for it.</p><p>Our goal was a lightweight super fast algorithm, and we have achieve this. Our forecaster is not as configurable as the others, nor does it use pandas, so it wont be for everyone. However, if you want to benchmark on a large number of series (as we do), it might appeal to you. Just watch out for the odd misfit!</p><p>ARIMA is available on main will be in release 1.3. We hope to also have ETS, AutoARIMA and AutoETS in the same release. We have put together a <a href="https://github.com/aeon-toolkit/aeon-benchmark/blob/master/examples/forecasting/arima.ipynb">notebook </a>to reproduce this.</p><p>Words and experiments by Tony Bagnall (github @TonyBagnall). Tony is a core developer of aeon and a professor of Computer Science at the University of Southampton.</p><p>Big credit to Alex Banwell (@alexbanwell1) who has lead the implementation of Nelder-Mead and the stats forecasters as part of the first year of his PhD.</p><p><a href="https://github.com/aeon-toolkit/aeon">aeon</a>is an international open source project affiliated with NumFOCUS.</p><p><a href="https://github.com/aeon-toolkit/aeon">GitHub - aeon-toolkit/aeon: A toolkit for machine learning from time series</a></p><p>We communicate mostly on slack, come say hello (link from github). Tony is happy to answer any questions on slack or by email. Tell me if I’ve made any mistakes in the above, it has been known to happen.</p><p>We will be at ECML/PKDD in Portugal in September 2025, if you are there come say hi at the AALTD workshop on Friday, where Alex presents some of his forecasting research.</p><p>aeon is supported by the EPSRC in the UK and has recently secured significant funding outside of the UK …. news to follow</p><p>Reference us with this paper</p><iframe src="https://drive.google.com/viewerng/viewer?url=https%3A//www.jmlr.org/papers/volume25/23-1444/23-1444.pdf&amp;embedded=true" width="600" height="780" frameborder="0" scrolling="no"><a href="https://medium.com/media/3a11331ae82ba3ed6944597f8e52c6e2/href">https://medium.com/media/3a11331ae82ba3ed6944597f8e52c6e2/href</a></iframe><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9796d7015cdd" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>