import pandas as pd
from ts2ml.core import add_missing_slots
from ts2ml.core import transform_ts_data_into_features_and_targetts2ml
Tools to Transform a Time Series into Features and Target Dataset
Install
pip install ts2mlHow to use
df = pd.DataFrame({
'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
'pickup_location_id': [1, 1, 1, 2, 2, 2],
'rides': [2, 3, 1, 1, 2, 1]
})
df| pickup_hour | pickup_location_id | rides | |
|---|---|---|---|
| 0 | 2022-01-01 00:00:00 | 1 | 2 |
| 1 | 2022-01-01 01:00:00 | 1 | 3 |
| 2 | 2022-01-01 03:00:00 | 1 | 1 |
| 3 | 2022-01-01 01:00:00 | 2 | 1 |
| 4 | 2022-01-01 02:00:00 | 2 | 2 |
| 5 | 2022-01-01 05:00:00 | 2 | 1 |
Let’s fill the missing slots with zeros
df = add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
df100%|██████████| 2/2 [00:00<00:00, 907.86it/s]
| pickup_hour | pickup_location_id | rides | |
|---|---|---|---|
| 0 | 2022-01-01 00:00:00 | 1 | 2 |
| 1 | 2022-01-01 01:00:00 | 1 | 3 |
| 2 | 2022-01-01 02:00:00 | 1 | 0 |
| 3 | 2022-01-01 03:00:00 | 1 | 1 |
| 4 | 2022-01-01 04:00:00 | 1 | 0 |
| 5 | 2022-01-01 05:00:00 | 1 | 0 |
| 6 | 2022-01-01 00:00:00 | 2 | 0 |
| 7 | 2022-01-01 01:00:00 | 2 | 1 |
| 8 | 2022-01-01 02:00:00 | 2 | 2 |
| 9 | 2022-01-01 03:00:00 | 2 | 0 |
| 10 | 2022-01-01 04:00:00 | 2 | 0 |
| 11 | 2022-01-01 05:00:00 | 2 | 1 |
Now, let’s build features and targets to predict the number of rides for the next hour for each location_id, by using the historical number of rides for the last 3 hours
features, targets = transform_ts_data_into_features_and_target(
df,
n_features=3,
datetime_col='pickup_hour',
entity_col='pickup_location_id',
value_col='rides',
n_targets=1,
step_size=1,
step_name='hour'
)100%|██████████| 2/2 [00:00<00:00, 597.86it/s]
features| rides_previous_3_hour | rides_previous_2_hour | rides_previous_1_hour | pickup_hour | pickup_location_id | |
|---|---|---|---|---|---|
| 0 | 2.0 | 3.0 | 0.0 | 2022-01-01 03:00:00 | 1 |
| 1 | 3.0 | 0.0 | 1.0 | 2022-01-01 04:00:00 | 1 |
| 2 | 0.0 | 1.0 | 2.0 | 2022-01-01 03:00:00 | 2 |
| 3 | 1.0 | 2.0 | 0.0 | 2022-01-01 04:00:00 | 2 |
targets| target_rides_next_hour | |
|---|---|
| 0 | 1.0 |
| 1 | 0.0 |
| 2 | 0.0 |
| 3 | 0.0 |
Xy_df = pd.concat([features, targets], axis=1)
Xy_df| rides_previous_3_hour | rides_previous_2_hour | rides_previous_1_hour | pickup_hour | pickup_location_id | target_rides_next_hour | |
|---|---|---|---|---|---|---|
| 0 | 2.0 | 3.0 | 0.0 | 2022-01-01 03:00:00 | 1 | 1.0 |
| 1 | 3.0 | 0.0 | 1.0 | 2022-01-01 04:00:00 | 1 | 0.0 |
| 2 | 0.0 | 1.0 | 2.0 | 2022-01-01 03:00:00 | 2 | 0.0 |
| 3 | 1.0 | 2.0 | 0.0 | 2022-01-01 04:00:00 | 2 | 0.0 |
Another Example
Montly spaced time series
import pandas as pd
import numpy as np
# Generate timestamp index with monthly frequency
date_rng = pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')
# Create list of city codes
cities = ['FOR', 'SP', 'RJ']
# Create dataframe with random sales data for each city on each month
df = pd.DataFrame({
'date': date_rng,
'city': np.repeat(cities, len(date_rng)//len(cities)),
'sales': np.random.randint(1000, 5000, size=len(date_rng))
})
df| date | city | sales | |
|---|---|---|---|
| 0 | 2020-01-01 | FOR | 4944 |
| 1 | 2020-02-01 | FOR | 3435 |
| 2 | 2020-03-01 | FOR | 4543 |
| 3 | 2020-04-01 | FOR | 3879 |
| 4 | 2020-05-01 | FOR | 2601 |
| 5 | 2020-06-01 | FOR | 2922 |
| 6 | 2020-07-01 | FOR | 4542 |
| 7 | 2020-08-01 | FOR | 1338 |
| 8 | 2020-09-01 | FOR | 2938 |
| 9 | 2020-10-01 | FOR | 2695 |
| 10 | 2020-11-01 | FOR | 4065 |
| 11 | 2020-12-01 | FOR | 3864 |
| 12 | 2021-01-01 | SP | 2652 |
| 13 | 2021-02-01 | SP | 2137 |
| 14 | 2021-03-01 | SP | 2663 |
| 15 | 2021-04-01 | SP | 1168 |
| 16 | 2021-05-01 | SP | 4523 |
| 17 | 2021-06-01 | SP | 4135 |
| 18 | 2021-07-01 | SP | 3566 |
| 19 | 2021-08-01 | SP | 2121 |
| 20 | 2021-09-01 | SP | 1070 |
| 21 | 2021-10-01 | SP | 1624 |
| 22 | 2021-11-01 | SP | 3034 |
| 23 | 2021-12-01 | SP | 4063 |
| 24 | 2022-01-01 | RJ | 2297 |
| 25 | 2022-02-01 | RJ | 3430 |
| 26 | 2022-03-01 | RJ | 2903 |
| 27 | 2022-04-01 | RJ | 4197 |
| 28 | 2022-05-01 | RJ | 4141 |
| 29 | 2022-06-01 | RJ | 2899 |
| 30 | 2022-07-01 | RJ | 4529 |
| 31 | 2022-08-01 | RJ | 3612 |
| 32 | 2022-09-01 | RJ | 1856 |
| 33 | 2022-10-01 | RJ | 4804 |
| 34 | 2022-11-01 | RJ | 1764 |
| 35 | 2022-12-01 | RJ | 4425 |
FOR city only have data for 2020 year, RJ only for 2022 and SP only for 2021. Let’s also simulate more missing slots between the years.
# Generate random indices to drop
drop_indices = np.random.choice(df.index, size=int(len(df)*0.2), replace=False)
# Drop selected rows from dataframe
df = df.drop(drop_indices)
df.reset_index(drop=True, inplace=True)
df| date | city | sales | |
|---|---|---|---|
| 0 | 2020-01-01 | FOR | 4944 |
| 1 | 2020-02-01 | FOR | 3435 |
| 2 | 2020-03-01 | FOR | 4543 |
| 3 | 2020-04-01 | FOR | 3879 |
| 4 | 2020-05-01 | FOR | 2601 |
| 5 | 2020-06-01 | FOR | 2922 |
| 6 | 2020-07-01 | FOR | 4542 |
| 7 | 2020-08-01 | FOR | 1338 |
| 8 | 2020-09-01 | FOR | 2938 |
| 9 | 2020-11-01 | FOR | 4065 |
| 10 | 2020-12-01 | FOR | 3864 |
| 11 | 2021-01-01 | SP | 2652 |
| 12 | 2021-02-01 | SP | 2137 |
| 13 | 2021-03-01 | SP | 2663 |
| 14 | 2021-07-01 | SP | 3566 |
| 15 | 2021-08-01 | SP | 2121 |
| 16 | 2021-10-01 | SP | 1624 |
| 17 | 2021-11-01 | SP | 3034 |
| 18 | 2021-12-01 | SP | 4063 |
| 19 | 2022-01-01 | RJ | 2297 |
| 20 | 2022-02-01 | RJ | 3430 |
| 21 | 2022-03-01 | RJ | 2903 |
| 22 | 2022-04-01 | RJ | 4197 |
| 23 | 2022-05-01 | RJ | 4141 |
| 24 | 2022-06-01 | RJ | 2899 |
| 25 | 2022-09-01 | RJ | 1856 |
| 26 | 2022-10-01 | RJ | 4804 |
| 27 | 2022-11-01 | RJ | 1764 |
| 28 | 2022-12-01 | RJ | 4425 |
Now lets fill the missing slots with zero values. The function will complete the missing slots with zeros:
df_full = add_missing_slots(df, datetime_col='date', entity_col='city', value_col='sales', freq='MS')
df_full100%|██████████| 3/3 [00:00<00:00, 843.70it/s]
| date | city | sales | |
|---|---|---|---|
| 0 | 2020-01-01 | FOR | 4944 |
| 1 | 2020-02-01 | FOR | 3435 |
| 2 | 2020-03-01 | FOR | 4543 |
| 3 | 2020-04-01 | FOR | 3879 |
| 4 | 2020-05-01 | FOR | 2601 |
| ... | ... | ... | ... |
| 103 | 2022-08-01 | RJ | 0 |
| 104 | 2022-09-01 | RJ | 1856 |
| 105 | 2022-10-01 | RJ | 4804 |
| 106 | 2022-11-01 | RJ | 1764 |
| 107 | 2022-12-01 | RJ | 4425 |
108 rows × 3 columns
Let’s build a dataset for training a machine learning model to predict the sales for the next 3 months, for each city, based on historical data of sales for the previous 6 months.
features, targets = transform_ts_data_into_features_and_target(
df_full,
n_features=3,
datetime_col='date',
entity_col='city',
value_col='sales',
n_targets=1,
step_size=1,
step_name='month'
)100%|██████████| 3/3 [00:00<00:00, 205.58it/s]
pd.concat([features, targets], axis=1)| sales_previous_3_month | sales_previous_2_month | sales_previous_1_month | date | city | target_sales_next_month | |
|---|---|---|---|---|---|---|
| 0 | 4944.0 | 3435.0 | 4543.0 | 2020-04-01 | FOR | 3879.0 |
| 1 | 3435.0 | 4543.0 | 3879.0 | 2020-05-01 | FOR | 2601.0 |
| 2 | 4543.0 | 3879.0 | 2601.0 | 2020-06-01 | FOR | 2922.0 |
| 3 | 3879.0 | 2601.0 | 2922.0 | 2020-07-01 | FOR | 4542.0 |
| 4 | 2601.0 | 2922.0 | 4542.0 | 2020-08-01 | FOR | 1338.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 91 | 4197.0 | 4141.0 | 2899.0 | 2022-07-01 | RJ | 0.0 |
| 92 | 4141.0 | 2899.0 | 0.0 | 2022-08-01 | RJ | 0.0 |
| 93 | 2899.0 | 0.0 | 0.0 | 2022-09-01 | RJ | 1856.0 |
| 94 | 0.0 | 0.0 | 1856.0 | 2022-10-01 | RJ | 4804.0 |
| 95 | 0.0 | 1856.0 | 4804.0 | 2022-11-01 | RJ | 1764.0 |
96 rows × 6 columns
Embedding on Sklearn Pipelines
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformeradd_missing_slots_transformer = FunctionTransformer(
add_missing_slots,
kw_args={
'datetime_col': 'date',
'entity_col': 'city',
'value_col': 'sales',
'freq': 'MS'
}
)
transform_ts_data_into_features_and_target_transformer = FunctionTransformer(
transform_ts_data_into_features_and_target,
kw_args={
'n_features': 3,
'datetime_col': 'date',
'entity_col': 'city',
'value_col': 'sales',
'n_targets': 1,
'step_size': 1,
'step_name': 'month',
'concat_Xy': True
}
)ts_data_to_features_and_target_pipeline = make_pipeline(
add_missing_slots_transformer,
transform_ts_data_into_features_and_target_transformer
)
ts_data_to_features_and_target_pipelinePipeline(steps=[('functiontransformer-1',
FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
kw_args={'datetime_col': 'date',
'entity_col': 'city',
'freq': 'MS',
'value_col': 'sales'})),
('functiontransformer-2',
FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
kw_args={'concat_Xy': True,
'datetime_col': 'date',
'entity_col': 'city',
'n_features': 3, 'n_targets': 1,
'step_name': 'month',
'step_size': 1,
'value_col': 'sales'}))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer-1',
FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
kw_args={'datetime_col': 'date',
'entity_col': 'city',
'freq': 'MS',
'value_col': 'sales'})),
('functiontransformer-2',
FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
kw_args={'concat_Xy': True,
'datetime_col': 'date',
'entity_col': 'city',
'n_features': 3, 'n_targets': 1,
'step_name': 'month',
'step_size': 1,
'value_col': 'sales'}))])FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
kw_args={'datetime_col': 'date', 'entity_col': 'city',
'freq': 'MS', 'value_col': 'sales'})FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
kw_args={'concat_Xy': True, 'datetime_col': 'date',
'entity_col': 'city', 'n_features': 3,
'n_targets': 1, 'step_name': 'month',
'step_size': 1, 'value_col': 'sales'})Xy_df = ts_data_to_features_and_target_pipeline.fit_transform(df)
Xy_df100%|██████████| 3/3 [00:00<00:00, 715.47it/s]
100%|██████████| 3/3 [00:00<00:00, 184.12it/s]
| sales_previous_3_month | sales_previous_2_month | sales_previous_1_month | date | city | target_sales_next_month | |
|---|---|---|---|---|---|---|
| 0 | 4944.0 | 3435.0 | 4543.0 | 2020-04-01 | FOR | 3879.0 |
| 1 | 3435.0 | 4543.0 | 3879.0 | 2020-05-01 | FOR | 2601.0 |
| 2 | 4543.0 | 3879.0 | 2601.0 | 2020-06-01 | FOR | 2922.0 |
| 3 | 3879.0 | 2601.0 | 2922.0 | 2020-07-01 | FOR | 4542.0 |
| 4 | 2601.0 | 2922.0 | 4542.0 | 2020-08-01 | FOR | 1338.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 91 | 4197.0 | 4141.0 | 2899.0 | 2022-07-01 | RJ | 0.0 |
| 92 | 4141.0 | 2899.0 | 0.0 | 2022-08-01 | RJ | 0.0 |
| 93 | 2899.0 | 0.0 | 0.0 | 2022-09-01 | RJ | 1856.0 |
| 94 | 0.0 | 0.0 | 1856.0 | 2022-10-01 | RJ | 4804.0 |
| 95 | 0.0 | 1856.0 | 4804.0 | 2022-11-01 | RJ | 1764.0 |
96 rows × 6 columns