Understanding data with time series decomposition

2 June 2023

A lot of marketing data comes as a time series. Website traffic per day, ad spend by month, products sold per quarter – all of these can be classed as time series data. This post will show you how to explore the trend and seasonality of your data quickly and accurately.

Time-based data isn’t always the simplest to deal with when it comes to understanding your data. A line chart showing traffic over time can be easy to read, but you’ll have to dig deeper to spot subtle shifts in trend or seasonality. This is an essential skill if you’re doing the type of analysis that requires forecasting and projections.

What is time series decomposition?

Time series decomposition is a statistical technique used to break down a time series into its constituent components: trend, seasonality, and remainder. Splitting data out like this allows us to study the individual components for patterns and variations, which we might otherwise miss.

For an interesting example we’ll use the total page views of Wikipedia from 2019 to late May 2023, shown in the line chart below:

We can see straight away that there looks to have been some sort of anomaly in the first half of 2020, with daily views rocketing up to over 700 million on some days. At this level there’s not enough information to consider why this blip occurred, though anecdotally there’s a chance it could be COVID related.

To figure out what’s normal and what’s noise we need to decompose this time series data, and we can do this with just a few lines of code.

Components of time series decomposition:

Trend

The trend component represents the long-term, systematic movement of a time series. It captures the overall direction and magnitude of change over an extended period, rather than short little windows of time. The trend is what we can rely on to highlight growth or decline, and is the easiest component to interpret.

Seasonality

Seasonality refers to regular, repetitive patterns that occur within a time series at fixed intervals. These patterns can be daily, weekly, monthly, or any other recurring cycle. Extracting seasonality allows us to examine the recurring patterns and understand how the data varies within each season or cycle. (Peaks in November, troughs during Christmas etc.)

Residual

The residual, also known as the remainder / irregular / random component, represents the unpredictable and unexplained fluctuations remaining after removing the trend and seasonality components. It encompasses the noise or random variability in the data that cannot be attributed to any systematic pattern. Analysing the residual component is valuable for identifying unusual events or unexpected behaviour in the time series. (Such as changes in customer behaviour due to COVID lockdowns.)

Methods of time series decomposition:

There are several ways to decompose your data. If you had a lot of time to kill, you might try and tackle it manually. If you were especially mad, you might even try this in a spreadsheet, which would likely involve calculating a lot of moving averages, and then some other long-winded process to get the data you actually want.

Fortunately, there are two very quick ways to decompose time series data in R, using either the “decomp” function, or achieving a more comprehensive result using the “stl” function.

Decomposition using the decomp function in R:

This function performs a standard decomposition using moving averages. To do this you simply take your data in a Time-Series format and run it through the decomp function. It’s so basic that adding any additional arguments besides the name of your data set isn’t normally needed (although the documentation is here.

For example, let’s say that my Wikipedia page views data was called “wiki_views”, the function run in R would look like this:

decomp <- decompose(wiki_views)

This runs the function and stores the result in a list called “decomp”. A second command of

plot(decomp)

will generate a rather ugly looking chart like the one below:

This method is quick and dirty, and doesn’t provide us with random or trend values for the start or end of our data – which isn’t helpful if we want to analyse something that is currently happening.

Better time series decomposition with LOESS

LOESS, which is an abbreviation for “local regression” is done using the “stl” function in R. It might sound like an odd name for a function but this is also an abbreviation, and stands for Seasonal Decomposition of Time Series by Loess. This function combines local regression with seasonal decomposition and can handle non-linear trends as well as irregular seasonality.

To decompose your Time-Series formatted data all that’s needed is to run the stl function, setting the s.window to periodic.

decomp_loess <-stl(wiki_views, s.window = "periodic")

Once this has run you’re free to play with your data. Below is a more beautiful example of the seasonality, trend and remainder than what we had earlier. We can see that the coverage is much better for the trend and remainder values. It’s also much easier to tell a story using these components:

Quick insights from the decomposition:

The trend shows us that Wikipedia was getting steadily more views than it normally would throughout late 2019 which flattened off by 2021. A downward trend continued well into April 2022 – at which point things begin to improve again.

The seasonality shows us that underneath a noisy layer of data are detectable, cyclical patterns. There’s the expected crash in page views on the 24th and 25th of December which is very common to see in traffic trends.

Interestingly, there’s a common pattern of rapid decline from the first week of June to the last week of July. From there usage builds steadily to a short drop in December, which springs right up to the peak January-May period.

The remainder shows us a clear spike in page views in early 2020, which would normally be worth investigating if there was the opportunity to dig deeper.

The data overall shows us a pretty clear picture of a surge in demand throughout early 2020 that was not anticipated. While we can’t prove it without proper analysis, we can afford ourselves a few theories based on anecdotal evidence.

It could well be the case that WIkipedia’s commitment to neutral and evidence-based content made it a trusted resource to depend upon for many people across the globe as various COVID lockdowns came into play. This could explain the sudden and prolonged jump in page views during the height of the COVID crisis.

Perhaps unrelated to COVID, there was a steady decline in traffic throughout much of 2021, leading to a depressed 2022. Traffic to the site appears to be gradually trending upwards again as we progress through 2023.

Despite the overall trend, seasonal patterns are prominent. There are obvious quiet periods during the Christmas break. The drop in June-July periods could potentially be a compounding of various vacation periods (particularly in the Northern Hemisphere), summer breaks for schools and universities, as well as cultural and sporting events across different regions.

During this time, students and educators are on vacation, which can lead to a decrease in overall internet usage for educational purposes. As a result, people may have less need to access Wikipedia during these months.

Seasonally Adjusted Data

Once we know the seasonality of a time series we can remove this element from the data. This is a useful way to try and show overall progress without becoming distracted by seasonal peaks and troughs.

To do this we simply subtract the seasonal element from the observed data. In this example it would be something like:

wikiSeasonAdj <- wiki_views - decomp_loess$seasonal

Here we’ve created a new Time-Series object called wikiSeasonAdj by subtracting the seasonal component (decomp_loess$seasonal) from our original time series. The chart below shows the original and the seasonally adjusted data – observe the absence of Christmas and summer vacation crashes:

Benefits and applications of time series decomposition:

As we’ve just demonstrated above, time series decomposition offers several advantages and applications that help us understand our data better.

Trend and Seasonality Analysis:

By isolating the trend and seasonality components, analysts can observe long-term trends, cyclic patterns, and season-specific effects within the data. This understanding can be instrumental in making informed decisions, predicting future behaviour, and identifying recurring patterns for strategic planning.

Projections / Forecasting:

Decomposed time series components can assist in accurate forecasting, and are a big step up from crudely smoothed manual attempts with spreadsheet. By modelling and predicting each component individually, we can combine the forecasts to obtain the final prediction. This approach improves forecasting accuracy by incorporating trend, seasonality, and random fluctuations separately.

Anomaly Detection:

Analysing the residual/remainder component can help identify anomalies or unusual behaviour in the time series. Unanticipated events, outliers, or anomalies can be detected by examining the deviations from the expected random fluctuations, providing insights into potential anomalies in the data.

Conclusion:

Time series decomposition is quick and easy once you know how to do it. It enables us to uncover the underlying components of a time series, which are the trend, seasonality, and residual. By examining these components individually we gain a deeper understanding of the patterns, cyclic behaviour, and random fluctuations present in the data. Knowing this makes it easier to tell a story with data.

This knowledge can aid in forecasting, trend analysis, anomaly detection, and informed decision-making across various domains. By harnessing the power of time series decomposition techniques like STL, analysts can unlock valuable insights and make data-driven decisions based on a comprehensive understanding of their time-varying data.

Data for this example analysis is publicly available to anyone, records of page views to different areas of wikipedia are available via bigquery-public-data.wikipedia data set and go back to 2015. This is terabytes of data, with around 52 billion records per year.

If you’re thirsty for the methodology behind how STL works, the 1990 paper in the Journal of Official Statistics titled “STL: A Seasonal-Trend Decomposition Procedure Based on Loess” goes into great detail.