Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Tuesday, 17 December 2024

A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.

In this post, you will discover a standardized yet complex time series forecasting problem that has these properties, but is small and sufficiently well understood that it can be used to explore and better understand methods for developing forecasting models on challenging datasets.

After reading this post, you will know:

The competition and motivation for addressing the air-quality dataset.
An overview of the defined prediction problem and the data challenges it covers.
A description of the free data files that you can download and start working with immediately.

Kick-start your project with my new book Time Series Forecasting With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem
Photo by someone, some rights reserved.

EMC Data Science Global Hackathon

The dataset was used as the center of a Kaggle competition.

Specifically, a 24-hour hackathon hosted by Data Science London and Data Science Global as part of a Big Data Week event, two organizations that don’t seem to exist now, 6 years later.

The competition involved a multi-thousand-dollar cash prize, and the dataset was provided by the Cook County, Illinois local government, suggesting all locations mentioned in the dataset are in that locality.

The motivation for the challenge is to develop a better model for predicting air quality, taken from the competition description:

The EPA’s Air Quality Index is used daily by people suffering from asthma and other respiratory diseases to avoid dangerous levels of outdoor air pollutants, which can trigger attacks. According to the World Health Organisation there are now estimated to be 235 million people suffering from asthma. Globally, it is now the most common chronic disease among children, with incidence in the US doubling since 1980.

The competition description suggests that winning models could be used as the basis for a new air-quality prediction system, although it is not clear if any models were ever transitioned for this purpose.

The competition was won by a Kaggle employee, Ben Hamner, who presumably did not collect the prize given the conflict of interest. Ben described his winning approach in the blog post titled “Chucking everything into a Random Forest: Ben Hamner on Winning The Air Quality Prediction Hackathon” and provided his code on GitHub.

There is also a good discussion of solutions and related code in this forum post titled “General approaches to partitioning the models?“.

Predictive Modeling Problem

The data describes a multi-step forecasting problem given a multivariate time series across multiple sites or physical locations.

Given multiple weather measurements over time, predict a sequence of air quality measurements at specific future time intervals across multiple physical locations.

It is a challenging time series forecasting problem that has a lot of the qualities of real-world forecasting:

Incomplete data. Not all weather and air quality measures are available for all locations.
Missing data. Not all available measures have a complete history.
Multivariate inputs: The model inputs for each forecast are comprised of multiple weather observations.
Multi-step outputs: The model outputs are a discontiguous sequence of forecasted air quality measures.
Multi-site outputs: The mode must output a multi-step forecast for multiple physical sites.

Download the Dataset Files

The dataset is available for free from the Kaggle website.

You must create an account and sign-in with Kaggle before you can get access to download the dataset.

The dataset can be downloaded from here:

Competition Data

Description of the Dataset Files

There are 4 files of interest that you must download separately; they are:

File: SiteLocations.csv

This file contains a list of site locations marked by unique identifiers and their precise location on Earth measured by longitude and latitude.

All coordinates appear to be relatively close in the North-Western Hemisphere, e.g. America.

Below is a sample of the file.

1
2
3
4
5
6
7
"SITE_ID","LATITUDE","LONGITUDE"
1,41.6709918952829,-87.7324568962847
32,41.755832412403,-87.545349670582
50,41.7075695897648,-87.5685738570845
57,41.9128621248178,-87.7227234452095
64,41.7907868783739,-87.6016464917605
...

File: SiteLocations_with_more_sites.csv

This file has the same format as SiteLocations.csv and appears to list all of the same locations as that file with some additional locations.

As the filename suggests, it is just an updated version of the list of sites.

Below is a sample of the file.

1
2
3
4
5
6
7
"SITE_ID","LATITUDE","LONGITUDE"
1,41.6709918952829,-87.7324568962847
14,41.834243,-87.6238
22,41.6871654376343,-87.5393154841479
32,41.755832412403,-87.545349670582
50,41.7075695897648,-87.5685738570845
...

File: TrainingData.csv

This file contains the training data for modeling.

The data is presented in an unnormalized manner. Each row of data contains one set of meteorological measurements for one hour across multiple locations as well as the targets or outcomes for each location for that hour.

The measures include:

Time information, including the block of time, the index within the contiguous block of time, the average month, day of the week, and hour of the day.
Wind measurements such as direction and speed.
Temperature measurements such as minimum and maximum ambient temperature.
Pressure measurements such as minimum and maximum barometric pressure.

The target variables are a collection of different air quality or pollution measures at different physical locations.

Not all locations have all weather measurements and not all locations are concerned with all target measures. Further, for those recorded variables, there are missing values marked as NA.

Below is a sample of the file.

1
2
3
4
5
6
7
"rowID","chunkID","position_within_chunk","month_most_common","weekday","hour","Solar.radiation_64","WindDirection..Resultant_1","WindDirection..Resultant_1018","WindSpeed..Resultant_1","WindSpeed..Resultant_1018","Ambient.Max.Temperature_14","Ambient.Max.Temperature_22","Ambient.Max.Temperature_50","Ambient.Max.Temperature_52","Ambient.Max.Temperature_57","Ambient.Max.Temperature_76","Ambient.Max.Temperature_2001","Ambient.Max.Temperature_3301","Ambient.Max.Temperature_6005","Ambient.Min.Temperature_14","Ambient.Min.Temperature_22","Ambient.Min.Temperature_50","Ambient.Min.Temperature_52","Ambient.Min.Temperature_57","Ambient.Min.Temperature_76","Ambient.Min.Temperature_2001","Ambient.Min.Temperature_3301","Ambient.Min.Temperature_6005","Sample.Baro.Pressure_14","Sample.Baro.Pressure_22","Sample.Baro.Pressure_50","Sample.Baro.Pressure_52","Sample.Baro.Pressure_57","Sample.Baro.Pressure_76","Sample.Baro.Pressure_2001","Sample.Baro.Pressure_3301","Sample.Baro.Pressure_6005","Sample.Max.Baro.Pressure_14","Sample.Max.Baro.Pressure_22","Sample.Max.Baro.Pressure_50","Sample.Max.Baro.Pressure_52","Sample.Max.Baro.Pressure_57","Sample.Max.Baro.Pressure_76","Sample.Max.Baro.Pressure_2001","Sample.Max.Baro.Pressure_3301","Sample.Max.Baro.Pressure_6005","Sample.Min.Baro.Pressure_14","Sample.Min.Baro.Pressure_22","Sample.Min.Baro.Pressure_50","Sample.Min.Baro.Pressure_52","Sample.Min.Baro.Pressure_57","Sample.Min.Baro.Pressure_76","Sample.Min.Baro.Pressure_2001","Sample.Min.Baro.Pressure_3301","Sample.Min.Baro.Pressure_6005","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"
1,1,1,10,"Saturday",21,0.01,117,187,0.3,0.3,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,6.1816228132982,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.56815355612325,0.690015329704154,NA,NA,NA,NA,NA,NA,2.84349016287551,0.0920223353681394,1.69321097077376,0.368089341472558,0.184044670736279,0.368089341472558,0.276067006104418,0.892616653070952,1.74842437199465,NA,NA,5.1306307034019,1.34160578423204,2.13879182993514,3.01375212399952,NA,5.67928016629218,NA
2,1,2,10,"Saturday",22,0.01,231,202,0.5,0.6,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.47583334194495,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.99138023331659,NA,5.56815355612325,0.923259948195698,NA,NA,NA,NA,NA,NA,3.1011527019063,0.0920223353681394,1.94167127626774,0.368089341472558,0.184044670736279,0.368089341472558,0.368089341472558,1.73922213845783,2.14412041407765,NA,NA,5.1306307034019,1.19577906855465,2.72209869264472,3.88871241806389,NA,7.42675098668978,NA
3,1,3,10,"Saturday",23,0.01,247,227,0.5,1.5,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.92192983362627,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.7524146053186,NA,5.56815355612325,0.680296803933673,NA,NA,NA,NA,NA,NA,3.06434376775904,0.0920223353681394,2.52141198908702,0.460111676840697,0.184044670736279,0.368089341472558,0.368089341472558,1.7852333061419,1.93246904273093,NA,NA,5.13639545700122,1.40965825154816,3.11096993445111,3.88871241806389,NA,7.68373198968942,NA
4,1,4,10,"Sunday",0,0.01,219,218,0.2,1.2,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,5.09824561921501,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.6776192223642,0.612267123540305,NA,NA,NA,NA,NA,NA,3.21157950434806,0.184044670736279,2.374176252498,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.86805340797323,2.08890701285676,NA,NA,5.21710200739181,1.47771071886428,2.04157401948354,3.20818774490271,NA,4.83124285639335,NA
5,1,5,10,"Sunday",1,0.01,2,216,0.2,0.3,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,4.87519737337435,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.31000107064725,NA,5.6776192223642,0.694874592589394,NA,NA,NA,NA,NA,NA,3.67169118118876,0.184044670736279,2.46619858786614,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.70241320431058,2.60423209091834,NA,NA,5.21710200739181,1.45826715677396,2.13879182993514,3.4998411762575,NA,4.62565805399363,NA
...

File: SubmissionZerosExceptNAs.csv

This file contains a sample of the submission for the prediction problem.

Each row specifies the prediction for each target measure across all target locations for a given hour in a chunk of contiguous time.

Below is a sample of the file.

1
2
3
4
5
6
7
"rowID","chunkID","position_within_chunk","hour","month_most_common","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"
193,1,193,21,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
194,1,194,22,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
195,1,195,23,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
196,1,196,0,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
197,1,197,1,10,0,0,-1e+06,0,0,0,0,0,0,0,-1e+06,0,-1e+06,0,0,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,-1e+06,0,0,0,0,0,0,0,0,0,-1e+06,-1e+06,0,0,0,0,-1e+06,0,-1e+06
...

Framing the Prediction Problem

A large part of the challenge of this prediction problem is the vast number of ways that the problem can be framed for modeling.

This is challenging because it is not clear which framing may be the best for this specific modeling problem.

For example, below are some questions to provoke thought about how the problem could be framed.

Is it better to impute or ignore missing observations?
Is it better to feed in a time series of weather observations or only the observations for the current hour?
Is it better to use weather observations from one or multiple source locations for a forecast?
Is it better to have one model for each location or one mode for all locations?
Is it better to have one model for each forecast time or one for all forecast times?

Summary

In this post, you discovered the Kaggle air-quality dataset that provides a standard dataset for complex time series forecasting.

Specifically, you learned:

The competition and motivation for addressing the air-quality dataset.
An overview of the defined prediction problem and the data challenges it covers.
A description of the free data files that can download and start working with immediately.

Have you worked on this dataset, or do you intend to?
Share your experiences in the comments below.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Tuesday, 17 December 2024

A Standard Multivariate, Multi-Step, and Multi-Site Time Series Forecasting Problem

EMC Data Science Global Hackathon

Predictive Modeling Problem

Download the Dataset Files

Description of the Dataset Files

File: SiteLocations.csv

File: SiteLocations_with_more_sites.csv

File: TrainingData.csv

File: SubmissionZerosExceptNAs.csv

Framing the Prediction Problem

Further Reading

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause