Outlier detection in time series using DBSCAN and SVR

PricingHUB
17 min readMay 29, 2024

--

PricingHUB is a company specialized in optimizing the day-to-day trading of our clients. Naturally, many of our algorithms work with data organized as a time series and for us it is important to identify and treat outliers that could otherwise mislead our models for price optimization. In this article, we intend to show what reasons motivated us to design our own outlier detection routine for time series, and how we built it by combining the usage of the clustering algorithm called DBSCAN together with the regression algorithm SVR. The text is splitted as follow:

  • Brief explanation of what outliers are and why it is important to detect them
  • Why PricingHUB needs a new routine different to the existing ones?
  • Intuitive explanation of our outlier detection routine
  • Assessment of the routine and comparison versus traditional routines using simulated outlier-affected data
  • Conclusions
  • Pseudocode

Brief explanation of what outliers are and why it is important to detect them.

An outlier is an observation that is distant from the others and breaks the pattern or patterns reflected by the rest of observations. If the observation deviates significantly from the big majority of the rest of observations, it is an “isolated outlier”. Maybe the observations get anomalous values not when compared versus the full data set, but only when compared with observations within the same context, in that case we refer to them as “contextual outliers”. If the outliers come in the form of a group of outliers we can refer to them as “collective outliers”. These anomalous data have a negative influence on models such as biased estimates, increased variance, violation of the model’s assumptions, model instability, etc. It is usually easy to recognize them visually but recognizing them automatically can be tricky.

Why does PricingHUB need a new routine?

To help our clients with their price optimization, at PricingHUB we consume data from our clients to feed our models. Many of the data are transactional data, that is, numbers of sales, revenue profit and product information. Companies that are our clients send us such transactional information on a daily basis. Unfortunately ETL/IT issues occur, rising some typical challenges that we need to face from time to time, like:

  • Client not sending the data for several days in a row.
  • Client sending partial data for several days in a row.
  • Client sending duplicated or partially duplicated data for several days in a row.

Typically, this low-quality transfer of data happens for a few days but, sometimes, it can take more than a week. Unfortunately, even if we spot these problems rapidly, we may have to wait days or weeks until the client fixes it and sends the right data. Thus, it is essential for us to detect those cases successfully and remove them to not contaminate our models.

Note that the above ETL/IT issues produce “collective outliers” in the sense that they affect several data points that are consecutive in time, so they constitute a group when the data is ordered by time. We will refer to them as “time period outliers’’. Their long standing duration could easily confuse even the more advanced outlier detection methodologies.

Advanced and robust outlier detection methodologies typically consist of fitting a model to the time series, and measuring somehow the deviation of the model predictions and reality. Points with a “high” deviation are considered outliers. How to determine what “high” means depends on the methodology. One weak point of this “model + deviation” approach is that the existence of outliers can compromise/contaminate the forecast model itself and thus the outlier detection per se. More in particular, the persistent duration of the “time period outliers” makes the fitting model proner to interpret them as outlier-free data forcing the model to fit and explain these data.

Concerned by the sensitivity of these types of outlier detection methods to the “time period outliers” we decided to give it a try to build our own outlier detection routine robust to “time period outliers”. How? we got an idea…FILAMENTS!

Intuitive explanation of our routine

Intuitively, the idea behind our methodology is to think of the time series as a filament (see figure 1). The assumption behind is that trading KPIs should not change drastically from one day to another, they should change smoothly, creating a continuous and predictable pattern, the filament. “Isolated outliers” will separate from the filament, inserting small holes in the filament of clean outlier-free data , while “time period outliers” would create a small different filament, breaking the outlier-free data into different filaments and separating them apart. Then, if we manage to automatically separate and identify the different filaments we would have most of the job done.

Time series broken into 3 smaller substructures of filamentary type.
Figure 1: Time series broken into smaller substructures of filamentary type.

Luckily there is one well-known algorithm very efficient in detecting filaments: DBSCAN!

DBSCAN takes 1 data point, creates a hypersphere with radius epsilon around it in the multidimensional space defined by the predictor dimensions and checks how many other points fall within the hypersphere. If a minimum number of points fall inside the hypersphere then a cluster is created consisting of the points within the hypersphere. Then each point in the cluster creates its own hyperspheres and identifies new points falling within its hypersphere; the new points are added to the cluster. This process is repeated for each new element of the cluster. Points in low-density regions that do not have in their vicinity (within the hypersphere) other points belonging to clusters are automatically considered outliers. The way DBSCAN creates the clusters makes it very efficient in detecting filamentary patterns. To know the basics of DBSCAN, its difference with other popular clustering techniques, and why it is so efficient in filamentary patterns, you can have a look at this article.

Figure 2: Performance comparison of different types of clustering algorithms. DBSCAN is a kind of Density Clustering (3rd column). When data is structured following a filamentary shape DBSCAN performs well in separating the different filaments.

By this point you should have already realized how convenient the properties of DBSCAN are for our idea of treating the time series as a combination of filaments. On one side, it can automatically detect “isolated outliers” and “contextual outliers” in our time series. Plus, it will naturally tend to separate in different clusters the outlier-free data and the “time period outliers”, as the latter will come as a form of shorter and separated filaments, which we will named “outliers filaments” (see illustration in Figure 1 for clarification). Again, note that the “outlier-free filament” may be chopped in several pieces due to the presence of “outlier filaments”.

Some challenges arise:

  • Challenge 1: We intend to use DBSCAN in a 2D problem, where the time is one dimension and the second dimension is the metric the time series is about, which can be sales, revenue, profit, etc. DBSCAN’s hyperspheres in this 2D case would be circles, but our two dimensions have completely different scales. For example, if our time series represents the revenue generated by the sales of a product then the Y-axis is in € while the X-axis is in time units. This creates a problem when defining the radius of the circles (the hyperspheres) to use by DBSCAN. We would need to work with something more similar to ellipses rather than circles to account for the differences in scale.
  • Challenge 2: DBSCAN would separate “time period outliers” and “outlier-free” data into different clusters/filaments but we have not yet been able to tell automatically, without visual inspection, which cluster is an “outlier filament” and which belongs to the “outlier-free” filaments.
  • Challenge 3: strong seasonality has the effect of moving away consecutive observations which in turn prevent consecutive data from falling inside the hyperspheres of their neighbors. This effect of the seasonality would contravene the formation of long filaments.

To solve the challenge 1 we scale the X-axis and the Y-axis in the following way:

  • For the X-axis, we first make sure there is no hole in the time series, if there are holes we fill them with values 100 times bigger than the max value in the time series. This way they will be identified as an outlier (otherwise our outlier detection routine is shity!). Then we consider time as measured in time units, so consecutive points in the time series have a distance of 1 time unit (it could mean 1 week, 1 day, 1 hour…).
  • Because of the above treatment on the X-axis, we need to make the Y-axis to have a typical distance between consecutive points of 1. If we achieve so, we would have the X and the Y axis in the same scale and we could work with circles, not needing ellipses anymore. To do so, we calculate the distances between consecutive points,

for all i, with i being used to indicate the observation at time i. Then we calculate the median of the resultant values and substitute Yᵢ by Yᵢ divided by the obtained median. With this transformation on the Y axis, two consecutive points would have a distance of

Considering all possible values of i, this would give a distribution that would be concentrated around 1.

The above transformation allows us to run DBSCAN efficiently on the scaled 2 dimensional space to automatically identify “isolated” and “contextual” outliers and find filaments. Now, let’s try to solve Challenge 2: which filaments belong to the outlier-free time series and which filaments represent a “time period outlier”? Our intention now is to identify the patterns or trends behind the clean/outlier-free data. Filaments separated from this pattern would be identified as outliers. The problem is that we still don’t know which filaments are clean and which ones are outliers so we can not tell the pattern drawn by only the clean data. We work by approximation, first we are going to assume that outliers (of any kind) do not represent a good proportion of the total observations, and second, we are going to use regression techniques to capture the general pattern drawn by the majority of the points as an approximation of the pattern described by the clean data. In order to minimize the effects of the outliers when capturing the pattern, we better use a regression model that is biased, and as less sensitive to outliers as possible. The fact that the errors in the SVR models do not affect the loss function quadratically but as the absolute error make it very convenient for our purpose since it makes SVR less sensitive to outliers than other regression models. Hence, we use as a regressor a SVR model and keep the model biased by limiting the predictors to the time variable and the seasonality (for example: for intra week seasonality we would use weekday as a predictor). We use a radial kernel to gain a bit of flexibility in case the pattern of the clean data is not fully linear. The cost hyperparameter of the SVR model helps us also with the sensitivity to outliers and the bias-overfitting balance, the lower the cost the more stiffen the model is and less prone to overfit.

We fit our SVR regression to the scaled data and calculate the residual values for each data point of the time series. Then, we calculate the average absolute residual within each cluster (i.e. filaments). The filament with higher average residual is the one that deviates the most from the pattern described by the regression model, and thus it becomes our main suspect of being an outlier filament and, consequently, a “time period outlier”.

I know what you are thinking. You are saying “Wait a minute! You told me before that the drawback of the “model + deviation” methodologies was that the fitting model was affected by the outliers. Aren’t you suffering the same bad? You are fitting the SVR model before finding the outliers so your fit might be biased and therefore the outlier identification per se!” Almost got me. In order to alleviate this problem we do the fitting and outlier identification in an iterative manner. The process goes like that:

  1. identify and remove “isolated outliers”, “contextual outliers”, and missing/NA values.
  2. create filaments with DBSCAN
  3. model Y as a function of time and seasons with SVR and calculate residuals.
  4. identify THE filament with HIGHEST average absolute residuals (note the equivalence to a MAE), if the average absolute residual is above a threshold then consider it an “outlier filament”.
  5. remove the “outlier filament” from the original time series and REPEAT the process by replacing original Y values by NA in the original time series.

Since we are aware that the models are not perfect due to the presence of outliers, in each iteration, we take the secure decision of ONLY removing the most evident or most suspicious “time period outliers” at a time. By doing the process iteratively we are improving our SVR in each cycle as we wipe outliers from the time series in every iteration, gaining confidence on the models in each iteration and thus on the next outlier identification.

There is (in reality there was) a big drawback in our routine: the total number of tuning parameters. The routine inherits all tuning parameters of the SVR and DBSCAN plus the new threshold to tell if the average absolute residual of a cluster/filament is high enough to be considered or not as outlier. However, we found a trick that allowed us to substitute all tuning parameters by only one. By modifying our rescaling of the Y axis to:

we are introducing a new tuning parameter, median factor. Note that, by definition, this new parameter is either directly or indirectly linked with the effects of tuning the radius of the hypersphere, tuning the number of points to fall in the hyperspheres to form clusters, and tuning the average absolute residual threshold. For example, by increasing the median_factor over a value of 1 we are shrinking the Y-axis therefore facilitating more points to enter in the hypersphere as the distance between consecutive points will reduce in the Y-axis direction. At the same time increasing the median_factor would make scaled Y values smaller, thus reducing the absolute values of the residuals. We have experienced that, fixing the following parameters as…

  • Tuning parameters in DBSCAN

→ hypersphere radius in DBSCAN: 3

→ minimum number of points to fall in the hypersphere to consider a cluster: 5

  • Tuning parameters in SVR:

→ kernel: radial

→ cost parameter : 0.01

  • Average absolute residual threshold: 2.5

…it is relatively easy to adapt to practically all time series we have worked in PricingHUB by only playing with median_factor values ranging from 0.8 to 1.6.

We have not explicitly explained how we manage with the challenge number 3, the one related to the seasonality, but note that we are tackling it in two folds. On one side we are including the seasonal pattern in our regression models, so residuals are not influenced by the seasonality. And on the other hand, we are tackling the fact that seasonality separates consecutive points in the Y axis by adjusting the median_factor. Incrementing it would shrink the Y-axis, diminishing the seasonality stretching. Indeed, a proper value of the median_factor will depend on how strong the seasonal pattern of the time series is.

In the following Section we will assess our routine using a simulated time series and compare our results with the one obtained using other methodologies.

Assessment of the routine and comparison versus traditional routines using simulated outlier-affected data

In order to illustrate the behavior of the different outlier detection routines, including ours, we are simulating a time series (see Figure 3) of daily periodicity, which suffers of all possible challenges:

  • It has an important trend.
  • It shows strong intra week seasonality.
  • It contains sporadic periods of extreme values (in orange)
  • It contains 1 period of missing data (in purple).
  • It contains 1 period of partial data. We have assumed that the client only transferred 30% of total data (in green).
  • it contains 1 period of duplicated data (in red).
  • plus, white noise of standard deviation of 100 units is introduced.
Figure 3: Simulated time series with intra week seasonality (blue point+lines), isolated outliers (orange points), and “time period” outliers (green, purple and red point+ lines).

We ran different outlier methodologies to try to identify the outliers in the simulated time series. Results are shown in Figure 7 later on, but first, among all tested methods we would like to focus on two of them apart from ours, the error standard deviation and the Hampel identifier. As for our experience and knowledge, these two methods normally perform very accurately in detecting outliers. We are going to use these two methods as a benchmark baseline for our new outlier identifier.

The “error standard deviation method” is an outlier identifier of the type “model + deviation” and is described by Nicolas Vandeput in this nice article. It works by creating a forecast model, taking the forecast errors, measuring the standard deviation of the errors, and considering outliers any data point in the series for which the real value is X times the standard deviation away from the model prediction (X typically taking values of 3 or 4). We have tested 3 different underlying forecast models, an ARIMA model, a Tbats model and a Prophet model. We will show in more details the results using Prophet as the underlying forecast model because of its special resistance to outliers.

Prophet is a procedure for forecasting time series data developed by Facebook in 2017 and is based on an additive model. It can take into account different seasonal patterns and holidays, and is claimed to work well with outliers, missing data and shifts in the trends. Figure 4 shows the outlier identification results for the error standard deviation using Prophet. Default parameters of Prophet were used in the fitting, and weekly seasonality was set to TRUE. The error measured by Prophet is used in the identifier. As anticipated, this routine finds difficulties in interpreting the “time period outliers” and wrongly identifies part of them as part of the clean/outlier-free time series. However, it is important to note that the misidentification does not happen because of the prophet model is trying to fit the outlier data, but because the error range increases as a consequence of the presence of the outliers, letting some “bad” data fall within the error range, which validates them making them tag as “no outliers”. The error standard deviation using other underlying forecast models different to Prophet, like ARIMA and TBATS fail in detecting the “time period outlier” because consider at least part of those points as part of the real trend of the time series (seen in Figure 7). Note that this strengthens the idea of the prophet being robust to outliers.

Figure 4. Original time series affected by outliers is shown in blue. Outliers identified by e-SDM routine are shown in orange. Outliers not identified by the e-SDM routine in green.

Figure 5 shows outlier detection results achieved by the Hampel identifier. This identifier routine is as well of the type “model + deviation”. it is a methodology conceptually similar to the classical outlier approach of using a moving average and a standard deviation, but instead of using a moving average uses a moving interval median (the model), and instead of using the standard deviation it uses a local median absolute deviation, MAD (the deviation). The length of the window to consider for the moving interval median is a tuning parameter. We have tested many possible values for the window length. Figure shows the results for a window length of 25, which was found to best identify the outliers. Note that this routine has not been able to identify the days affected by “partial data” as outliers and miss many of the outliers in the time period of duplicated data.

Figure 5. Original time series affected by outliers is shown in blue. Outliers identified by the Hampel identifier are shown in orange. Multiple values for the Tuning parameter of window length, k, have been tried. The best performer being k=25. Re for k=25 are shown. Outliers Not identified outliers are shown in green

Figure 6 shows the outlier detection results achieved by the PricingHUB routine. With the tuning parameter median_factor of 1.4, the routine is able to correctly identify all outliers.

Figure 6. Original time series affected by outliers is shown in blue. Outliers identified by the PricingHUB identifier are shown in orange. Best performance was achieved with a tuning parameter median_factor equal to 1.4. All outliers are identified correctly.

In Figure 7 we show an aggregated comparison of the points identified as outliers by the different routines tested. Apart from the e-SDM using prophet and the Hampel identifier, other e-SDM routines using ARIMA and Tbats are shown. Also, other simpler outlier detection techniques such as Winsoring and the standard deviation methodology are shown (for more info about them we refer again to Nicolas Vandeput’s article). The dots in the figure represent the dates identified as outliers by the different routines. The real outliers of the simulated time series are also shown for a proper comparison. Note that we only know the real outliers because we are under a simulated environment. In the real world we would not have access to this ground truth. As expected the more simplistic methodologies have difficulties in detecting the outliers in complex time series as the one under study, performing especially bad in “time period” outliers.

Figure 7. Points represent the location in the X-axis (date) of the outlier identified by each methodology. The location (in dates) for the real outliers are known because we are under a control environment since the time series is a simulation and outliers have been introduced artificially.

In summary…That is great! We succeeded in our target! Our outlier identification routine seems to work accurately and achieve the desired objective of properly identifying the “time period outliers” where other routines struggle. For the proposed time series it does not find any FALSE NEGATIVE nor any FALSE POSITIVE. Is it a kind of magic? Does it always work well? Of course not. We have identified some corner cases where our routine does not work fine. For example, the case of a sudden change in scale in the Y axis, together with the presence of some isolated outliers in the vicinity of the change of scale, makes our routine vulnerable (see Figure 8). This is because, on one hand the isolated outliers break the outlier-free filaments into smaller filaments (see filaments 1 and 2 in Figure 8); and on the other hand the SVR can not adapt to the abrupt change of scale. As a consequence of the combination of the two, one small filament in the edge of the cliff (filament 2 in Figure 8) got widely separated from the SVR model, and thus is considered wrongly as a “time period outlier”. Further fine tuning of the hyperparameters, beyond median_factor, is needed to overcome these challenges.

Figure 8. Performance of the PricingHUB outlier identifier for the case of a time series with an abrupt change of scale and the presence of outliers in the vicinity of the abrupt change. Original time series is shown in blue. Points identified as outliers are shown in orange. Real outliers are shown in green. Pattern estimated by the regressor SVR is shown in red. Filaments separated from the pattern of the regressor are considered outliers.

Conclusions

We have developed an outlier identifier combining the clustering algorithm DBSCAN and the regressors SVR. This routine performs better than classical outlier identifiers such as Winsoring and the standard deviation methodologies thanks to the fact that it controls the existence of trends and seasonality through the regression models. Also, it is robust when facing collective outliers that persist in time for many consecutive periods, which can be challenging even for advanced outlier identifiers based on “modeling + deviation”. This is thanks 1) to the density clustering that is able to separate the collective outliers from the clean data, 2) to the iterative process that improves the outlier identification with each iteration.

In addition, our experience using our routine suggests that its complexity can be simplified to, in practice, fine tune 1 single hyperparameter, which has been proven enough to work satisfactorily in a wide range of real time series.

Pseudocode (using R syntax)

  1. Set median factor close to 1.4,
  2. Read time series TS(X,Y): X being the temporal information.
  3. Complete the time series TS filling Y=NA for missing X in TS and store in TS’
  4. Using TS’, substitute NA in Y by 100 max (Y) so it can be considered as an evident outlier later on.
  5. Calculate median(diff(Y))
  6. Correct the scale of the Y:

7. Correct the scale of X: substitute X by sequence 1:length(X)

8. Run DBSCAN on the bidimensional space {X,Y} with set eps=3, and minPts=5. Get the clusters and identify clusters for each observation (Xᵢ , Yᵢ)

9. Remove isolated and contextual outliers, automatically identified by DBSCAN (cluster number = 0) from TS’. Store outlier free time series in TS’’.

10. Run SVR model using TS’’: response is Y, predictor is X and seasons (in our example it was weekdays)

11. Use the fitted SVR model to predict Y on TS’

12. Calculate the residuals on TS’ for each (Xᵢ , Yᵢ) -> store them is residual.svr

13. Group by cluster and calculate average absolute residual.svr per cluster.

14. Identify cluster with highest average absolute residual.svr

  • a. if value is lower than 2.5 then end process
  • b. otherwise:

i. consider the cluster with the highest average absolute residual.svras an “outlier filament”

ii. for all (Xᵢ , Yᵢ) belonging to the “outlier filament” set Yi as NA in the original time series, TS.

iii. go to step 3 and repeat process

Authors:

Juan Manuel Mayén Gijón, Head of Data Science in PricingHUB, and Pablo Romero Gómez, professor in the Malaga University and usual collaborator of PricingHUB.

--

--

PricingHUB
PricingHUB

Written by PricingHUB

PricingHUB is a Dynamic Pricing SaaS Platform designed to help retailer and e-commerce optimize their profitability and protect margins.

Responses (2)