Prism: Seasonal-Trend Regression

Prism is a Python 3 package for spline-based regression and decomposition of time series with seasonal and trend components.

Installation

Prism requires Python 3.6 or greater. It was developed and tested on x86_64 systems running MacOS and Linux.

Dependencies

The package extra dependencies are listed in the files requirements.txt and requirements-conda.txt. It is recommended to install those extra dependencies using Miniconda or Anaconda. This is not just a pure stylistic choice but comes with some hidden advantages, such as the linking to Intel MKL library (a highly optimized BLAS library created by Intel).

>> conda install --channel=conda-forge --file=requirements-conda.txt

Quick Install

Prism can be installed very simply via the command:

>> pip install git+git://github.com/matthieumeo/prism.git@master#egg=prism

If you have previously activated your conda environment pip will install Prism in said environment. Otherwise it will install it in your base environment together with the various dependencies obtained from the file requirements.txt.

Developper Install

It is also possible to install Prism from the source for developpers:

>> git clone https://github.com/matthieumeo/prism
>> cd <repository_dir>/
>> pip install -e .

The package documentation can be generated with:

>> conda install sphinx=='2.1.*'
>> python3 setup.py build_sphinx -b singlehtml

Methods

Parametric Model

Consider an unknown signal \(f:\mathbb{R}\to \mathbb{R}\) of which we possess noisy samples

\[y_\ell=f(t_\ell)+\epsilon_\ell, \qquad \ell =1,\ldots, L,\]

for some non-uniform sample times \(\{t_1,\ldots, t_L\}\subset \mathbb{R}\) and additive i.i.d. noise perturbations \(\epsilon_\ell\).

The SeasonalTrendRegression class assumes the following parametric form for \(f\):

\[f(t)=\sum_{n=1}^N \alpha_n \rho_{P}(t-\theta_n) \quad + \quad \sum_{m=1}^M \beta_m \psi_{Q}(t-\eta_m) \quad+\quad \sum_{q=1}^{Q}\gamma_q t^{q-1}. \tag{1}\]

where:

  • \(\sum_{n=1}^N \alpha_n=0\) (so as to enforce a zero-mean seasonal component, see note below),

  • \(\rho_{P}:[0, \Theta[\to \mathbb{R}\) is the \(\Theta\)-periodic Green function associated to the \(P\)-iterated derivative operator \(D^P\) [Periodic],

  • \(\psi_{Q}:\mathbb{R}\to \mathbb{R}\) is the causal Green function associated to the \(Q\)-iterated derivative operator \(D^Q\) [Green],

  • \(\theta_n:=(n-1)\Theta/N,\) and \(\eta_m:=R_{min}\,+\,m(R_{max}-R_{min})/(M+1),\) for some \(R_{min}<R_{max}\).

The seasonal component \(f_S(t):=\sum_{n=1}^N \alpha_n \rho_{P}(t-\theta_n)\) is a (zero mean) \(\Theta\)-periodic spline w.r.t. the iterated derivative operator \(D^k\) [Periodic]. It models short-term cyclical temporal variations of the signal \(f\).

The trend component \(f_T(t):=\sum_{m=1}^M \beta_m \psi_{Q}(t-\eta_m)+\sum_{q=1}^{Q}\gamma_q t^{q-1}\) is a spline w.r.t. the iterated derivative operator \(D^Q\). It is the sum of a piecewise polynomial term \(\sum_{m=1}^M \beta_m \psi_{Q}(t-\eta_m)\) and a degree \(Q-1\) polynomial \(\sum_{q=1}^{Q}\gamma_q t^{q-1}\) [Splines]. It models long-term non-cyclical temporal variations of \(f\).

Note

Note that the mean of \(f\) could be assigned to the seasonal or trend component interchangeably, or even split between the two components. Constraining the seasonal component to have zero mean allows us to fix this unidentifiability issue.

Fitting Procedure

The method fit() recovers the coefficients \(\mathbf{a}=[\alpha_1,\ldots,\alpha_N]\in\mathbb{R}^N\), \(\mathbf{b}=[\beta_1,\ldots,\beta_M]\in\mathbb{R}^M\) and \(\mathbf{c}=[\gamma_1,\ldots,\gamma_{Q}]\in\mathbb{R}^{Q}\) from the data \(\mathbf{y}=[y_1,\ldots,y_L]\in\mathbb{R}^L\) as minima at posteriori (MAP) of the negative log-posterior distribution:

\[(\hat{\mathbf{a}}, \hat{\mathbf{b}}, \hat{\mathbf{c}})= \arg\min_{({\mathbf{a}}, {\mathbf{b}},{\mathbf{c}})\in\mathbb{R}^N\times\mathbb{R}^M\times \mathbb{R}^Q } \frac{1}{p}\left\|\mathbf{y}-\mathbf{K} \mathbf{a} -\mathbf{L} \mathbf{b} - \mathbf{V} \mathbf{c}\right\|_p^p \,+\, \lambda \left(\theta \|\mathbf{a}\|_1 + (1-\theta)\|\mathbf{b}\|_1\right) \,+\, \iota(\mathbf{1}^T\mathbf{a}) \tag{2}\]

where:

  • \(p\in \{1,2\}\) is chosen according the the noise distribution (\(p=1\) in the presence of outliers),

  • \(\mathbf{K}\in\mathbb{R}^{L\times N}\) is given by: \(\mathbf{K}_{\ell n}=\rho_{P}(t_\ell-\theta_j)\),

  • \(\mathbf{L}\in\mathbb{R}^{L\times M}\) is given by: \(\mathbf{L}_{\ell m}=\psi_{Q}(t_\ell-\eta_m)\),

  • \(\mathbf{V}\in\mathbb{R}^{L\times Q}\) is a Vandermonde matrix given by: \(\mathbf{V}_{\ell q}=t^{q-1}_\ell\),

  • \(\lambda>0\) and \(\theta\in [0,1]\) are regularisation parameters,

  • \(\iota:\mathbb{R}\to\{0, +\infty\}\) is the indicator function returning \(0\) if the input is zero and \(+\infty\) otherwise.

If requested by the user, the penalty parameter \(\lambda\) can be learnt from the data by introducing a Gamma hyper-prior and finding the MAP of the posterior on \({\mathbf{a}}, {\mathbf{b}},{\mathbf{c}}\) and \(\lambda\) jointly [Tuning].

The \(R^2\)-score of the regression is provided by the method r2score().

Uncertainty Quantification

The method sample_credible_region() returns approximate marginal pointwise credible intervals for the model parameters, the seasonal and trend components and their sum. This is achieved by sampling uniformly (via the hit-and-run algorithm) the approximate highest density posterior credible region [Credible]:

\[C_{\xi}=\left\{({\mathbf{a}}, {\mathbf{b}},{\mathbf{c}})\in\mathbb{R}^N\times\mathbb{R}^M\times \mathbb{R}^Q: J({\mathbf{a}}, {\mathbf{b}},{\mathbf{c}})\leq J(\hat{\mathbf{a}}, \hat{\mathbf{b}}, \hat{\mathbf{c}}) + (N+M+Q) (\nu_\xi+1)\right\} \tag{3}\]

where \(\xi\in [0,1]\) is the confidence level associated to the credible region, \(\nu_\xi:=\sqrt{16\log(3/\xi)/(N+M+Q)}\), and \(J:\mathbb{R}^N\times\mathbb{R}^M\times \mathbb{R}^Q\to \mathbb{R}_+\) is the negative log-posterior in (3).

Approximate marginal pointwise credible intervals are then obtained by evaluating (1) (and the seasonal/trend components) for the various samples of \(C_{\xi}\) gathered and then taking the pointwise minima and maxima of all the sample curves.

Note that (3) can also be used for hypothesis testing on the parameters of the model (see method is_credible() for more on the topic).

Reference Documentation

The seasonal-trend regression is performed by means of the SeasonalTrendRegression class described below:

class SeasonalTrendRegression(sample_times: numpy.ndarray, sample_values: numpy.ndarray, period: numbers.Number, forecast_times: numpy.ndarray, seasonal_forecast_times: numpy.ndarray, nb_of_knots: Tuple[int, int] = (32, 32), spline_orders: Tuple[int, int] = (3, 2), penalty_strength: Optional[float] = None, penalty_tuning: bool = True, test_times: Optional[numpy.ndarray] = None, test_values: Optional[numpy.ndarray] = None, robust: bool = False, theta: float = 0.5, dtype: type = <class 'numpy.float64'>, tol: float = 0.001)

Bases: object

Seasonal-trend regression of a noisy and non-uniform time series.

Parameters
  • sample_times (numpy.ndarray) – Sample times \(\{t_1, \ldots, t_L\}\subset \mathbb{R}\).

  • sample_values (numpy.ndarray) – Observed values of the signal \(\{y_1, \ldots, y_L\}\subset \mathbb{R}\) at the sample times. Can be noisy.

  • period (Number) – Period \(\Theta>0\) for the seasonal component.

  • forecast_times (numpy.ndarray) – Unobserved times for forecasting of the signal and its trend component.

  • seasonal_forecast_times (numpy.ndarray) – Unobserved times in \([0,\Theta)\) for forecasting of the seasonal component.

  • nb_of_knots (Tuple[int, int]) – Number of knots \((N, M)\) for the seasonal and trend component respectively. High values of \(N\) and \(M\) can lead to numerical instability.

  • spline_orders (Tuple[int, int]) – Exponents \((P,Q)\) of the iterated derivative operators defining the splines involved in the seasonal and trend component respectively. Both parameters must be strictly bigger than one.

  • penalty_strength (Optional[Number]) – Value of the penalty strength \(\lambda\in \mathbb{R}_+\).

  • penalty_tuning (bool) – Whether or not the penalty strength \(\lambda\) should be learnt from the data.

  • test_times (Optional[numpy.ndarray]) – Optional test times to assess the regression performances.

  • test_values (Optional[numpy.ndarray]) – Optional test values to assess the regression performances.

  • robust (bool) – If True, \(p=1\) is chosen for the data-fidelity term in (2), otherwise \(p=2\). Should be set to True in the presence of outliers.

  • theta (float) – Value of the balancing parameter \(\theta\in [0,1]\). Setting theta as 0 or 1 respectively removes the penalty on the seasonal or trend component in (2).

  • dtype (type) – Types of the various numpy arrays involved in the estimation procedure.

  • tol (float) – Tolerance for computing Lipschitz constants. For experimented users only!

Raises

ValueError – If self.sample_times.size != self.sample_values.size, len(self.spline_orders) != 2, 1 in self.spline_orders, len(self.nb_of_knots) != 2 or not 0 <= theta <= 1.

fit(max_outer_iterations: int = 10, max_inner_iterations: int = 10000, accuracy_parameter: float = 1e-05, accuracy_hyperparameter: float = 0.001, verbose: Optional[int] = None) → Tuple[numpy.ndarray, float]

Fit the model (1) by solving (2).

Parameters
  • max_outer_iterations (int) – Maximum number of outer iterations for auto-tuning of the penalty strength \(\lambda\).

  • max_inner_iterations (int) – Maximum number of inner iterations when solving (2) with a fixed value of \(\lambda\).

  • accuracy_parameter (float) – Minimum relative improvement in the iterate of the numerical solver for (2).

  • accuracy_hyperparameter (float) – Minimum relative improvement of the penalty strength.

  • verbose (Optional[int]) – Verbosity level of the method.

Returns

Estimates of \((\hat{\mathbf{a}}, \hat{\mathbf{b}}, \hat{\mathbf{c}})\) concatenated as a single array with size \(N+M+Q\) and the auto-tuned penalty parameter \(\lambda\).

Return type

Tuple[np.ndarray, float]

r2score(dataset: str = 'training') → dict

\(R^2-score\) of the regression.

Parameters

dataset (['training', 'test']) – Dataset on which to evaluate the \(R^2-score\).

Returns

Dictionary containing the \(R^2-scores\) of the seasonal and trend components as well as the sum of the two.

Return type

dict

predict() → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Predict the values of the signal and its seasonal/trend components at times specified by the attributes self.forecast_times and self.seasonal_forecast_times.

Returns

Predicted values for the seasonal, trend and sum of the two respectively.

Return type

Tuple[np.ndarray, np.ndarray, np.ndarray]

sample_credible_region(n_samples: float = 100000.0, credible_lvl: float = 0.01, return_samples: bool = False, seed: int = 1, subsample_by: int = 100) → Tuple[dict, dict, Optional[dict]]

Sample approximate credible region and returns pointwise marginal credible intervals.

Parameters
  • n_samples (float) – Number of samples.

  • credible_lvl (float) – Credible level \(\xi\in [0,1]\).

  • return_samples (bool) – If True, return the samples on top of the pointwise marginal credible intervals.

  • seed (int) – Seed for the pseudo-random number generator.

  • subsample_by – Subsampling factor (1 every subsample_by samples are stored if return_samples==True).

Returns

Dictionaries with keys {‘coeffs’, ‘seasonal’, ‘trend’, ‘sum’}. The values associated to the keys of the first two dictionaries are the minimum and maximum of the credible intervals of the corresponding component. The values associated to the keys of the last dictionary are arrays containing credible samples of the corresponding component.

Return type

Tuple[dict, dict, Optional[dict]]

is_credible(coeffs: numpy.ndarray, credible_lvl: Optional[float] = None, gamma: Optional[float] = None) → Tuple[bool, bool, float]

Test whether a set of coefficients are credible for a given confidence level.

Parameters
  • coeffs (np.ndarray) – Coefficients to be tested (array of size (N+M+Q)).

  • credible_lvl (Optional[float]) – Credible level \(\xi\in[0,1]\).

  • gamma (Optional[float]) – Undocumented. For internal use only.

Returns

The set of coefficients are credible if the product of the first two output is 1. The last output is for internal use only.

Return type

Tuple[bool, bool, float]

plot_data(fig: Optional[int] = None, test_set: bool = True)

Plot the training and test datasets.

Parameters
  • fig (Optional[int]) – Figure handle. If None creates a new figure.

  • test_set – Optional test dataset.

plot_green_functions(fig: Optional[int] = None, component: str = 'trend')

Plot the shifted Green functions involved in the parametric model (1).

Parameters
  • fig (Optional[int]) – Figure handle. If None creates a new figure.

  • component (['seasonal', 'trend']) – Component for which the shifted Green functions should be plotted.

summary_plot(fig: Optional[int] = None, min_values: Optional[dict] = None, max_values: Optional[dict] = None)

Plot the result of the regression.

Parameters
  • fig (Optional[int]) – Figure handle. If None creates a new figure.

  • min_values (Optional[dict]) – Minimum of the pointwise marginal credible intervals for the various components. Output of sample_credible_region().

  • max_values (Optional[dict]) – Maximum of the pointwise marginal credible intervals for the various components. Output of sample_credible_region().

plot_seasonal_component(fig: Optional[int] = None, samples_seasonal: Optional[numpy.ndarray] = None, min_values: Optional[numpy.ndarray] = None, max_values: Optional[numpy.ndarray] = None)

Plot the seasonal component.

Parameters
  • fig (Optional[int]) – Figure handle. If None creates a new figure.

  • samples_seasonal (Optional[np.ndarray] = None) – Sample curves for the seasonal component. Output of sample_credible_region().

  • min_values (Optional[np.ndarray]) – Minimum of the pointwise marginal credible intervals for the seasonal component. Output of sample_credible_region().

  • max_values (Optional[np.ndarray]) – Maximum of the pointwise marginal credible intervals for the seasonal component. Output of sample_credible_region().

plot_trend_component(fig: Optional[int] = None, samples_trend: Optional[numpy.ndarray] = None, min_values: Optional[numpy.ndarray] = None, max_values: Optional[numpy.ndarray] = None)

Plot the trend component.

Parameters
  • fig (Optional[int]) – Figure handle. If None creates a new figure.

  • samples_trend (Optional[np.ndarray] = None) – Sample curves for the trend component. Output of sample_credible_region().

  • min_values (Optional[np.ndarray]) – Minimum of the pointwise marginal credible intervals for the trend component. Output of sample_credible_region().

  • max_values (Optional[np.ndarray]) – Maximum of the pointwise marginal credible intervals for the trend component. Output of sample_credible_region().

plot_sum(fig: Optional[int] = None, samples_sum: Optional[numpy.ndarray] = None, min_values: Optional[numpy.ndarray] = None, max_values: Optional[numpy.ndarray] = None)

Plot the sum of the seasonal and trend component.

Parameters
  • fig (Optional[int]) – Figure handle. If None creates a new figure.

  • samples_sum (Optional[np.ndarray] = None) – Sample curves for the sum of the two component. Output of sample_credible_region().

  • min_values (Optional[np.ndarray]) – Minimum of the pointwise marginal credible intervals for the sum of the two components. Output of sample_credible_region().

  • max_values (Optional[np.ndarray]) – Maximum of the pointwise marginal credible intervals for the sum of the two components. Output of sample_credible_region().

Examples

Atmospheric CH4

import numpy as np
import matplotlib.pyplot as plt
from prism import SeasonalTrendRegression
from prism import ch4_mlo_dataset

plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100

data, data_test = ch4_mlo_dataset()

period = 1
sample_times = data.time_decimal.values
sample_values = data.value.values
test_times = data_test.time_decimal.values
test_values = data_test.value.values
forecast_times = np.linspace(test_times.min(), test_times.max(), 4096)
seasonal_forecast_times = np.linspace(0, period, 1024)

streg = SeasonalTrendRegression(sample_times=sample_times, sample_values=sample_values, period=period,
                                forecast_times=forecast_times, seasonal_forecast_times=seasonal_forecast_times,
                                nb_of_knots=(32, 32), spline_orders=(3, 2), penalty_strength=1, penalty_tuning=False,
                                test_times=test_times, test_values=test_values, robust=False, theta=0.5)
streg.plot_data()
xx, mu = streg.fit(verbose=1)
r2score = streg.r2score()
min_values, max_values, samples = streg.sample_credible_region(return_samples=True, subsample_by=1000)
streg.summary_plot(min_values=min_values, max_values=max_values)
streg.plot_seasonal_component(min_values=min_values['seasonal'], max_values=max_values['seasonal'],
                      samples_seasonal=samples['seasonal'])
streg.plot_trend_component(min_values=min_values['trend'], max_values=max_values['trend'],
                           samples_trend=samples['trend'])
streg.plot_sum(min_values=min_values['sum'], max_values=max_values['sum'], samples_sum=samples['sum'])

(Source code)

_images/ch4_mlo-1_00.png

(png, hires.png, pdf)

_images/ch4_mlo-1_01.png

(png, hires.png, pdf)

_images/ch4_mlo-1_02.png

(png, hires.png, pdf)

_images/ch4_mlo-1_03.png

(png, hires.png, pdf)

_images/ch4_mlo-1_04.png

(png, hires.png, pdf)

Atmospheric CO2

import numpy as np
import matplotlib.pyplot as plt
from prism import SeasonalTrendRegression
from prism import co2_mlo_dataset

plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100

data, data_test = co2_mlo_dataset()

period = 1
sample_times = data.time_decimal.values
sample_values = data.value.values
test_times = data_test.time_decimal.values
test_values = data_test.value.values
forecast_times = np.linspace(test_times.min(), test_times.max(), 4096)
seasonal_forecast_times = np.linspace(0, period, 1024)

streg = SeasonalTrendRegression(sample_times=sample_times, sample_values=sample_values, period=period,
                                forecast_times=forecast_times, seasonal_forecast_times=seasonal_forecast_times,
                                nb_of_knots=(32, 16), spline_orders=(3, 2), penalty_strength=1, penalty_tuning=True,
                                test_times=test_times, test_values=test_values, robust=False, theta=0.5)
streg.plot_data()
xx, mu = streg.fit(verbose=1)
min_values, max_values, samples = streg.sample_credible_region(return_samples=True)
streg.summary_plot(min_values=min_values, max_values=max_values)
streg.plot_seasonal_component(min_values=min_values['seasonal'], max_values=max_values['seasonal'],
                              samples_seasonal=samples['seasonal'])
streg.plot_trend_component(min_values=min_values['trend'], max_values=max_values['trend'],
                           samples_trend=samples['trend'])
streg.plot_sum(min_values=min_values['sum'], max_values=max_values['sum'], samples_sum=samples['sum'])

(Source code)

_images/co2_mlo-1_00.png

(png, hires.png, pdf)

_images/co2_mlo-1_01.png

(png, hires.png, pdf)

_images/co2_mlo-1_02.png

(png, hires.png, pdf)

_images/co2_mlo-1_03.png

(png, hires.png, pdf)

_images/co2_mlo-1_04.png

(png, hires.png, pdf)

US Electronic Equipment

import numpy as np
import matplotlib.pyplot as plt
from prism import SeasonalTrendRegression
from prism import elec_equip_dataset

plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100

data = elec_equip_dataset()

period = 1
sample_times = data.time_decimal.values
sample_values = data.value.values
forecast_times = np.linspace(sample_times.min(), sample_times.max(), 4096)
seasonal_forecast_times = np.linspace(0, period, 1024)

streg = SeasonalTrendRegression(sample_times=sample_times, sample_values=sample_values, period=period,
                                forecast_times=forecast_times, seasonal_forecast_times=seasonal_forecast_times,
                                nb_of_knots=(32, 40), spline_orders=(3, 2), penalty_strength=1, penalty_tuning=True,
                                test_times=None, test_values=None, robust=False, theta=0.5)
streg.plot_data()
xx, mu = streg.fit(verbose=1)
min_values, max_values, samples = streg.sample_credible_region(return_samples=True)
streg.summary_plot(min_values=min_values, max_values=max_values)
streg.plot_seasonal_component(min_values=min_values['seasonal'], max_values=max_values['seasonal'],
                              samples_seasonal=samples['seasonal'])
streg.plot_trend_component(min_values=min_values['trend'], max_values=max_values['trend'],
                           samples_trend=samples['trend'])
streg.plot_sum(min_values=min_values['sum'], max_values=max_values['sum'], samples_sum=samples['sum'])

(Source code)

_images/elec_equip-1_00.png

(png, hires.png, pdf)

_images/elec_equip-1_01.png

(png, hires.png, pdf)

_images/elec_equip-1_02.png

(png, hires.png, pdf)

_images/elec_equip-1_03.png

(png, hires.png, pdf)

_images/elec_equip-1_04.png

(png, hires.png, pdf)

References

Periodic

Fageot, Julien, and Matthieu Simeoni. “TV-based reconstruction of periodic functions.” Inverse Problems 36.11 (2020): 115015.

Splines

Gupta, Harshit, Julien Fageot, and Michael Unser. “Continuous-domain solutions of linear inverse problems with Tikhonov versus generalized TV regularization.” IEEE Transactions on Signal Processing 66.17 (2018): 4670-4684.

Green

Unser, Michael, Julien Fageot, and John Paul Ward. “Splines are universal solutions of linear inverse problems with generalized TV regularization.” SIAM Review 59.4 (2017): 769-793.

Tuning

Pereyra, Marcelo, José M. Bioucas-Dias, and Mário AT Figueiredo. “Maximum-a-posteriori estimation with unknown regularisation parameters.” 2015 23rd European Signal Processing Conference (EUSIPCO). IEEE, 2015.

Credible

Pereyra, Marcelo. “Maximum-a-posteriori estimation with Bayesian confidence regions.” SIAM Journal on Imaging Sciences 10.1 (2017): 285-302.