Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 31 July 2024

How To Resample and Interpolate Your Time Series Data With Python

You may have observations at the wrong frequency.

Maybe they are too granular or not granular enough. The Pandas library in Python provides the capability to change the frequency of your time series data.

In this tutorial, you will discover how to use Pandas in Python to both increase and decrease the sampling frequency of time series data.

After completing this tutorial, you will know:

About time series resampling, the two types of resampling, and the 2 main reasons why you need to use them.
How to use Pandas to upsample time series data to a higher frequency and interpolate the new observations.
How to use Pandas to downsample time series data to a lower frequency and summarize the higher frequency observations.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Dec/2016: Fixed definitions of upsample and downsample.
Updated Apr/2019: Updated the link to dataset.

How To Resample and Interpolate Your Time Series Data With Python
Photo by sung ming whang, some rights reserved.

Resampling

Resampling involves changing the frequency of your time series observations.

Two types of resampling are:

Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds.
Downsampling: Where you decrease the frequency of the samples, such as from days to months.

In both cases, data must be invented.

In the case of upsampling, care may be needed in determining how the fine-grained observations are calculated using interpolation. In the case of downsampling, care may be needed in selecting the summary statistics used to calculate the new aggregated values.

There are perhaps two main reasons why you may be interested in resampling your time series data:

Problem Framing: Resampling may be required if your data is not available at the same frequency that you want to make predictions.
Feature Engineering: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

There is a lot of overlap between these two cases.

For example, you may have daily data and want to predict a monthly problem. You could use the daily data directly or you could downsample it to monthly data and develop your model.

A feature engineering perspective may use observations and summaries of observations from both time scales and more in developing a model.

Let’s make resampling more concrete by looking at a real dataset and some examples.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3 year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

Download the dataset.

Below is a sample of the first 5 rows of data, including the header row.

"Month","Sales"

"1-01",266.0

"1-02",145.9

"1-03",183.1

"1-04",119.3

"1-05",180.3

Below is a plot of the entire dataset.

Shampoo Sales Dataset

The dataset shows an increasing trend and possibly some seasonal components.

Load the Shampoo Sales Dataset

Download the dataset and place it in the current working directory with the filename “shampoo-sales.csv“.

Download the dataset.

The timestamps in the dataset do not have an absolute year, but do have a month. We can write a custom date parsing function to load this dataset and pick an arbitrary year, such as 1900, to baseline the years from.

Below is a snippet of code to load the Shampoo Sales dataset using the custom date parsing function from read_csv().

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

print(series.head())

series.plot()

pyplot.show()

Running this example loads the dataset and prints the first 5 rows. This shows the correct handling of the dates, baselined from 1900.

Month

1901-01-01 266.0

1901-02-01 145.9

1901-03-01 183.1

1901-04-01 119.3

1901-05-01 180.3

Name: Sales of shampoo over a three year period, dtype: float64

We also get a plot of the dataset, showing the rising trend in sales from month to month.

Plot of the Shampoo Sales Dataset

Upsample Shampoo Sales

The observations in the Shampoo Sales are monthly.

Imagine we wanted daily sales information. We would have to upsample the frequency from monthly to daily and use an interpolation scheme to fill in the new daily frequency.

The Pandas library provides a function called resample() on the Series and DataFrame objects. This can be used to group records when downsampling and making space for new observations when upsampling.

We can use this function to transform our monthly dataset into a daily dataset by calling resampling and specifying the preferred frequency of calendar day frequency or “D”.

Pandas is clever and you could just as easily specify the frequency as “1D” or even something domain specific, such as “5D.” See the further reading section at the end of the tutorial for the list of aliases that you can use.

from pandas import read_csv

from pandas import datetime

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

upsampled = series.resample('D')

print(upsampled.head(32))

Running this example prints the first 32 rows of the upsampled dataset, showing each day of January and the first day of February.

Month

1901-01-01 266.0

1901-01-02 NaN

1901-01-03 NaN

1901-01-04 NaN

1901-01-05 NaN

1901-01-06 NaN

1901-01-07 NaN

1901-01-08 NaN

1901-01-09 NaN

1901-01-10 NaN

1901-01-11 NaN

1901-01-12 NaN

1901-01-13 NaN

1901-01-14 NaN

1901-01-15 NaN

1901-01-16 NaN

1901-01-17 NaN

1901-01-18 NaN

1901-01-19 NaN

1901-01-20 NaN

1901-01-21 NaN

1901-01-22 NaN

1901-01-23 NaN

1901-01-24 NaN

1901-01-25 NaN

1901-01-26 NaN

1901-01-27 NaN

1901-01-28 NaN

1901-01-29 NaN

1901-01-30 NaN

1901-01-31 NaN

1901-02-01 145.9

We can see that the resample() function has created the rows by putting NaN values in the new values. We can see we still have the sales volume on the first of January and February from the original data.

Next, we can interpolate the missing values at this new frequency.

The Series Pandas object provides an interpolate() function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions. You may have domain knowledge to help choose how values are to be interpolated.

A good starting point is to use a linear interpolation. This draws a straight line between available data, in this case on the first of the month, and fills in values at the chosen frequency from this line.

from pandas import read_csv

from pandas import datetime

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

upsampled = series.resample('D')

interpolated = upsampled.interpolate(method='linear')

print(interpolated.head(32))

Running this example, we can see interpolated values.

Month

1901-01-01 266.000000

1901-01-02 262.125806

1901-01-03 258.251613

1901-01-04 254.377419

1901-01-05 250.503226

1901-01-06 246.629032

1901-01-07 242.754839

1901-01-08 238.880645

1901-01-09 235.006452

1901-01-10 231.132258

1901-01-11 227.258065

1901-01-12 223.383871

1901-01-13 219.509677

1901-01-14 215.635484

1901-01-15 211.761290

1901-01-16 207.887097

1901-01-17 204.012903

1901-01-18 200.138710

1901-01-19 196.264516

1901-01-20 192.390323

1901-01-21 188.516129

1901-01-22 184.641935

1901-01-23 180.767742

1901-01-24 176.893548

1901-01-25 173.019355

1901-01-26 169.145161

1901-01-27 165.270968

1901-01-28 161.396774

1901-01-29 157.522581

1901-01-30 153.648387

1901-01-31 149.774194

1901-02-01 145.900000

Looking at a line plot, we see no difference from plotting the original data as the plot already interpolated the values between points to draw the line.

Shampoo Sales Interpolated Linear

Another common interpolation method is to use a polynomial or a spline to connect the values.

This creates more curves and can look more natural on many datasets. Using a spline interpolation requires you specify the order (number of terms in the polynomial); in this case, an order of 2 is just fine.

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

upsampled = series.resample('D')

interpolated = upsampled.interpolate(method='spline', order=2)

print(interpolated.head(32))

interpolated.plot()

pyplot.show()

Running the example, we can first review the raw interpolated values.

Month

1901-01-01 266.000000

1901-01-02 258.630160

1901-01-03 251.560886

1901-01-04 244.720748

1901-01-05 238.109746

1901-01-06 231.727880

1901-01-07 225.575149

1901-01-08 219.651553

1901-01-09 213.957094

1901-01-10 208.491770

1901-01-11 203.255582

1901-01-12 198.248529

1901-01-13 193.470612

1901-01-14 188.921831

1901-01-15 184.602185

1901-01-16 180.511676

1901-01-17 176.650301

1901-01-18 173.018063

1901-01-19 169.614960

1901-01-20 166.440993

1901-01-21 163.496161

1901-01-22 160.780465

1901-01-23 158.293905

1901-01-24 156.036481

1901-01-25 154.008192

1901-01-26 152.209039

1901-01-27 150.639021

1901-01-28 149.298139

1901-01-29 148.186393

1901-01-30 147.303783

1901-01-31 146.650308

1901-02-01 145.900000

Reviewing the line plot, we can see more natural curves on the interpolated values.

Shampoo Sales Interpolated Spline

Generally, interpolation is a useful tool when you have missing observations.

Next, we will consider resampling in the other direction and decreasing the frequency of observations.

Downsample Shampoo Sales

The sales data is monthly, but perhaps we would prefer the data to be quarterly.

The year can be divided into 4 business quarters, 3 months a piece.

Instead of creating new rows between existing observations, the resample() function in Pandas will group all observations by the new frequency.

We could use an alias like “3M” to create groups of 3 months, but this might have trouble if our observations did not start in January, April, July, or October. Pandas does have a quarter-aware alias of “Q” that we can use for this purpose.

We must now decide how to create a new quarterly value from each group of 3 records. A good starting point is to calculate the average monthly sales numbers for the quarter. For this, we can use the mean() function.

Putting this all together, we get the following code example.

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

resample = series.resample('Q')

quarterly_mean_sales = resample.mean()

print(quarterly_mean_sales.head())

quarterly_mean_sales.plot()

pyplot.show()

Running the example prints the first 5 rows of the quarterly data.

Month

1901-03-31 198.333333

1901-06-30 156.033333

1901-09-30 216.366667

1901-12-31 215.100000

1902-03-31 184.633333

Freq: Q-DEC, Name: Sales, dtype: float64

We also plot the quarterly data, showing Q1-Q4 across the 3 years of original observations.

Shampoo Sales Downsampled Quarterly

Perhaps we want to go further and turn the monthly data into yearly data, and perhaps later use that to model the following year.

We can downsample the data using the alias “A” for year-end frequency and this time use sum to calculate the total sales each year.

from pandas import read_csv

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

resample = series.resample('A')

quarterly_mean_sales = resample.sum()

print(quarterly_mean_sales.head())

quarterly_mean_sales.plot()

pyplot.show()

Running the example shows the 3 records for the 3 years of observations.

We also get a plot, correctly showing the year along the x-axis and the total number of sales per year along the y-axis.

Shampoo Sales Downsampled Yearly Sum

Summary

In this tutorial, you discovered how to resample your time series data using Pandas in Python.

Specifically, you learned:

About time series resampling and the difference and reasons between downsampling and upsampling observation frequencies.
How to upsample time series data using Pandas and how to use different interpolation schemes.
How to downsample time series data using Pandas and how to summarize grouped data.

Do you have any questions about resampling or interpolating time series data or about this tutorial?
Ask your questions in the comments and I will do my best to answer them.

Basic Feature Engineering With Time Series Data in Python

Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms.

There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps.

In this tutorial, you will discover how to perform feature engineering on time series data with Python to model your time series problem with machine learning algorithms.

After completing this tutorial, you will know:

The rationale and goals of feature engineering time series data.
How to develop basic date-time based input features.
How to develop more sophisticated lag and sliding window summary statistics features.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s dive in.

Updated Jun/2017: Fixed a typo in the expanding window code example.
Updated Apr/2019: Updated the link to dataset.
Updated Aug/2019: Updated data loading to use new API.
Updated Sep/2019: Fixed bug in data loading.

Basic Feature Engineering With Time Series Data in Python
Photo by José Morcillo Valenciano, some rights reserved.

Feature Engineering for Time Series

A time series dataset must be transformed to be modeled as a supervised learning problem.

That is something that looks like:

time 1, value 1

time 2, value 2

time 3, value 3

To something that looks like:

input 1, output 1

input 2, output 2

input 3, output 3

So that we can train a supervised learning algorithm.

Input variables are also called features in the field of machine learning, and the task before us is to create or invent new input features from our time series dataset. Ideally, we only want input features that best help the learning methods model the relationship between the inputs (X) and the outputs (y) that we would like to predict.

In this tutorial, we will look at three classes of features that we can create from our time series dataset:

Date Time Features: these are components of the time step itself for each observation.
Lag Features: these are values at prior time steps.
Window Features: these are a summary of values over a fixed window of prior time steps.

Before we dive into methods for creating input features from our time series data, let’s first review the goal of feature engineering.

Stop learning Time Series Forecasting the slow way!

Take my free 7-day email course and discover how to get started (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Goal of Feature Engineering

The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm to model.

In effect, we are are moving complexity.

Complexity exists in the relationships between the input and output data. In the case of time series, there is no concept of input and output variables; we must invent these too and frame the supervised learning problem from scratch.

We may lean on the capability of sophisticated models to decipher the complexity of the problem. We can make the job for these models easier (and even use simpler models) if we can better expose the inherent relationship between inputs and outputs in the data.

The difficulty is that we do not know the underlying inherent functional relationship between inputs and outputs that we’re trying to expose. If we did know, we probably would not need machine learning.

Instead, the only feedback we have is the performance of models developed on the supervised learning datasets or “views” of the problem we create. In effect, the best default strategy is to use all the knowledge available to create many good datasets from your time series dataset and use model performance (and other project requirements) to help determine what good features and good views of your problem happen to be.

For clarity, we will focus on a univariate (one variable) time series dataset in the examples, but these methods are just as applicable to multivariate time series problems. Next, let’s take a look at the dataset we will use in this tutorial.

Minimum Daily Temperatures Dataset

In this post, we will use the Minimum Daily Temperatures dataset.

This dataset describes the minimum daily temperatures over 10 years (1981-1990) in Melbourne, Australia.

The units are in degrees Celsius and there are 3,650 observations. The source of the data is credited as the Australian Bureau of Meteorology.

Download the dataset.

Below is a sample of the first 5 rows of data, including the header row.

"Date","Temperature"

"1981-01-01",20.7

"1981-01-02",17.9

"1981-01-03",18.8

"1981-01-04",14.6

"1981-01-05",15.8

Below is a plot of the entire dataset.

Minimum Daily Temperatures

The dataset shows an increasing trend and possibly some seasonal components.

Download the dataset.

Date Time Features

Let’s start with some of the simplest features that we can use.

These are features from the date/time of each observation. In fact, these can start off simply and head off into quite complex domain-specific areas.

Two features that we can start with are the integer month and day for each observation. We can imagine that supervised learning algorithms may be able to use these inputs to help tease out time-of-year or time-of-month type seasonality information.

The supervised learning problem we are proposing is to predict the daily minimum temperature given the month and day, as follows:

Month, Day, Temperature

We can do this using Pandas. First, the time series is loaded as a Pandas Series. We then create a new Pandas DataFrame for the transformed dataset.

Next, each column is added one at a time where month and day information is extracted from the time-stamp information for each observation in the series.

Below is the Python code to do this.

# create date time features of a dataset

from pandas import read_csv

from pandas import DataFrame

series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0, parse_dates=True, squeeze=True)

dataframe = DataFrame()

dataframe['month'] = [series.index[i].month for i in range(len(series))]

dataframe['day'] = [series.index[i].day for i in range(len(series))]

dataframe['temperature'] = [series[i] for i in range(len(series))]

print(dataframe.head(5))

Running this example prints the first 5 rows of the transformed dataset.

month day temperature

0 1 1 20.7

1 1 2 17.9

2 1 3 18.8

3 1 4 14.6

4 1 5 15.8

Using just the month and day information alone to predict temperature is not sophisticated and will likely result in a poor model. Nevertheless, this information coupled with additional engineered features may ultimately result in a better model.

You may enumerate all the properties of a time-stamp and consider what might be useful for your problem, such as:

Minutes elapsed for the day.
Hour of day.
Business hours or not.
Weekend or not.
Season of the year.
Business quarter of the year.
Daylight savings or not.
Public holiday or not.
Leap year or not.

From these examples, you can see that you’re not restricted to the raw integer values. You can use binary flag features as well, like whether or not the observation was recorded on a public holiday.

In the case of the minimum temperature dataset, maybe the season would be more relevant. It is creating domain-specific features like this that are more likely to add value to your model.

Date-time based features are a good start, but it is often a lot more useful to include the values at previous time steps. These are called lagged values and we will look at adding these features in the next section.

Lag Features

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.

The simplest approach is to predict the value at the next time (t+1) given the value at the previous time (t-1). The supervised learning problem with shifted values looks as follows:

Value(t-1), Value(t+1)

The Pandas library provides the shift() function to help create these shifted or lag features from a time series dataset. Shifting the dataset by 1 creates the t-1 column, adding a NaN (unknown) value for the first row. The time series dataset without a shift represents the t+1.

Let’s make this concrete with an example. The first 3 values of the temperature dataset are 20.7, 17.9, and 18.8. The shifted and unshifted lists of temperatures for the first 3 observations are therefore:

Shifted, Original

NaN, 20.7

20.7, 17.9

17.9, 18.8

We can concatenate the shifted columns together into a new DataFrame using the concat() function along the column axis (axis=1).

Putting this all together, below is an example of creating a lag feature for our daily temperature dataset. The values are extracted from the loaded series and a shifted and unshifted list of these values is created. Each column is also named in the DataFrame for clarity.

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)

temps = DataFrame(series.values)

dataframe = concat([temps.shift(1), temps], axis=1)

dataframe.columns = ['t-1', 't+1']

print(dataframe.head(5))

Running the example prints the first 5 rows of the new dataset with the lagged feature.

t-1 t+1

0 NaN 20.7

1 20.7 17.9

2 17.9 18.8

3 18.8 14.6

4 14.6 15.8

You can see that we would have to discard the first row to use the dataset to train a supervised learning model, as it does not contain enough data to work with.

The addition of lag features is called the sliding window method, in this case with a window width of 1. It is as though we are sliding our focus along the time series for each observation with an interest in only what is within the window width.

We can expand the window width and include more lagged features. For example, below is the above case modified to include the last 3 observed values to predict the value at the next time step.

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)

temps = DataFrame(series.values)

dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)

dataframe.columns = ['t-3', 't-2', 't-1', 't+1']

print(dataframe.head(5))

Running this example prints the first 5 rows of the new lagged dataset.

t-3 t-2 t-1 t+1

0 NaN NaN NaN 20.7

1 NaN NaN 20.7 17.9

2 NaN 20.7 17.9 18.8

3 20.7 17.9 18.8 14.6

4 17.9 18.8 14.6 15.8

Again, you can see that we must discard the first few rows that do not have enough data to train a supervised model.

A difficulty with the sliding window approach is how large to make the window for your problem.

Perhaps a good starting point is to perform a sensitivity analysis and try a suite of different window widths to in turn create a suite of different “views” of your dataset and see which results in better performing models. There will be a point of diminishing returns.

Additionally, why stop with a linear window? Perhaps you need a lag value from last week, last month, and last year. Again, this comes down to the specific domain.

In the case of the temperature dataset, a lag value from the same day in the previous year or previous few years may be useful.

We can do more with a window than include the raw values. In the next section, we’ll look at including features that summarize statistics across the window.

Rolling Window Statistics

A step beyond adding raw lagged values is to add a summary of the values at previous time steps.

We can calculate summary statistics across the values in the sliding window and include these as features in our dataset. Perhaps the most useful is the mean of the previous few values, also called the rolling mean.

For example, we can calculate the mean of the previous two values and use that to predict the next value. For the temperature data, we would have to wait 3 time steps before we had 2 values to take the average of before we could use that value to predict a 3rd value.

For example:

mean(t-2, t-1), t+1

mean(20.7, 17.9), 18.8

19.3, 18.8

Pandas provides a rolling() function that creates a new data structure with the window of values at each time step. We can then perform statistical functions on the window of values collected for each time step, such as calculating the mean.

First, the series must be shifted. Then the rolling dataset can be created and the mean values calculated on each window of two values.

Here are the values in the first three rolling windows:

#, Window Values

1, NaN

2, NaN, 20.7

3, 20.7, 17.9

This suggests that we will not have usable data until the 3rd row.

Finally, as in the previous section, we can use the concat() function to construct a new dataset with just our new columns.

The example below demonstrates how to do this with Pandas with a window size of 2.

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)

temps = DataFrame(series.values)

shifted = temps.shift(1)

window = shifted.rolling(window=2)

means = window.mean()

dataframe = concat([means, temps], axis=1)

dataframe.columns = ['mean(t-2,t-1)', 't+1']

print(dataframe.head(5))

Running the example prints the first 5 rows of the new dataset. We can see that the first two rows are not useful.

The first NaN was created by the shift of the series.
The second because NaN cannot be used to calculate a mean value.
Finally, the third row shows the expected value of 19.30 (the mean of 20.7 and 17.9) used to predict the 3rd value in the series of 18.8.

mean(t-2,t-1) t+1

0 NaN 20.7

1 NaN 17.9

2 19.30 18.8

3 18.35 14.6

4 16.70 15.8

There are more statistics we can calculate and even different mathematical ways of calculating the definition of the “window.”

Below is another example that shows a window width of 3 and a dataset comprised of more summary statistics, specifically the minimum, mean, and maximum value in the window.

You can see in the code that we are explicitly specifying the sliding window width as a named variable. This lets us use it both in calculating the correct shift of the series and in specifying the width of the window to the rolling() function.

In this case, the window width of 3 means we must shift the series forward by 2 time steps. This makes the first two rows NaN. Next, we need to calculate the window statistics with 3 values per window. It takes 3 rows before we even have enough data from the series in the window to start calculating statistics. The values in the first 5 windows are as follows:

#, Window Values

1, NaN

2, NaN, NaN

3, NaN, NaN, 20.7

4, NaN, 20.7, 17.9

5, 20.7, 17.9, 18.8

This suggests that we would not expect usable data until at least the 5th row (array index 4)

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)

temps = DataFrame(series.values)

width = 3

shifted = temps.shift(width - 1)

window = shifted.rolling(window=width)

dataframe = concat([window.min(), window.mean(), window.max(), temps], axis=1)

dataframe.columns = ['min', 'mean', 'max', 't+1']

print(dataframe.head(5))

Running the code prints the first 5 rows of the new dataset.

We can spot check the correctness of the values on the 5th row (array index 4). We can see that indeed 17.9 is the minimum and 20.7 is the maximum of values in the window of [20.7, 17.9, 18.8].

min mean max t+1

0 NaN NaN NaN 20.7

1 NaN NaN NaN 17.9

2 NaN NaN NaN 18.8

3 NaN NaN NaN 14.6

4 17.9 19.133333 20.7 15.8

Expanding Window Statistics

Another type of window that may be useful includes all previous data in the series.

This is called an expanding window and can help with keeping track of the bounds of observable data. Like the rolling() function on DataFrame, Pandas provides an expanding() function that collects sets of all prior values for each time step.

These lists of prior numbers can be summarized and included as new features. For example, below are the lists of numbers in the expanding window for the first 5 time steps of the series:

#, Window Values

1, 20.7

2, 20.7, 17.9,

3, 20.7, 17.9, 18.8

4, 20.7, 17.9, 18.8, 14.6

5, 20.7, 17.9, 18.8, 14.6, 15.8

Again, you can see that we must shift the series one-time step to ensure that the output value we wish to predict is excluded from these window values. Therefore the input windows look as follows:

#, Window Values

1, NaN

2, NaN, 20.7

3, NaN, 20.7, 17.9,

4, NaN, 20.7, 17.9, 18.8

5, NaN, 20.7, 17.9, 18.8, 14.6

Thankfully, the statistical calculations exclude the NaN values in the expanding window, meaning no further modification is required.

Below is an example of calculating the minimum, mean, and maximum values of the expanding window on the daily temperature dataset.

# create expanding window features

from pandas import read_csv

from pandas import DataFrame

from pandas import concat

series = read_csv('daily-min-temperatures.csv', header=0, index_col=0)

temps = DataFrame(series.values)

window = temps.expanding()

dataframe = concat([window.min(), window.mean(), window.max(), temps.shift(-1)], axis=1)

dataframe.columns = ['min', 'mean', 'max', 't+1']

print(dataframe.head(5))

Running the example prints the first 5 rows of the dataset.

Spot checking the expanding minimum, mean, and maximum values shows the example having the intended effect.

min mean max t+1

0 20.7 20.700000 20.7 17.9

1 17.9 19.300000 20.7 18.8

2 17.9 19.133333 20.7 14.6

3 14.6 18.000000 20.7 15.8

4 14.6 17.560000 20.7 15.8

Summary

In this tutorial, you discovered how to use feature engineering to transform a time series dataset into a supervised learning dataset for machine learning.

Specifically, you learned:

The importance and goals of feature engineering time series data.
How to develop date-time and lag-based features.
How to develop sliding and expanding window summary statistic features.

Do you know of more feature engineering methods for time series?
Let me know in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 31 July 2024

How To Resample and Interpolate Your Time Series Data With Python

Resampling

Stop learning Time Series Forecasting the slow way!

Shampoo Sales Dataset

Load the Shampoo Sales Dataset

Upsample Shampoo Sales

Downsample Shampoo Sales

Further Reading

Summary

Basic Feature Engineering With Time Series Data in Python

Feature Engineering for Time Series

Stop learning Time Series Forecasting the slow way!

Goal of Feature Engineering

Minimum Daily Temperatures Dataset

Date Time Features

Lag Features

Rolling Window Statistics

Expanding Window Statistics

Summary

Report Abuse

Labels

"Donate for a Noble Cause