import pandas as pd
from ts2ml.core import add_missing_slots
from ts2ml.core import transform_ts_data_into_features_and_target
ts2ml
Tools to Transform a Time Series into Features and Target Dataset
Install
pip install ts2ml
How to use
= pd.DataFrame({
df 'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
'pickup_location_id': [1, 1, 1, 2, 2, 2],
'rides': [2, 3, 1, 1, 2, 1]
}) df
pickup_hour | pickup_location_id | rides | |
---|---|---|---|
0 | 2022-01-01 00:00:00 | 1 | 2 |
1 | 2022-01-01 01:00:00 | 1 | 3 |
2 | 2022-01-01 03:00:00 | 1 | 1 |
3 | 2022-01-01 01:00:00 | 2 | 1 |
4 | 2022-01-01 02:00:00 | 2 | 2 |
5 | 2022-01-01 05:00:00 | 2 | 1 |
Let’s fill the missing slots with zeros
= add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
df df
100%|██████████| 2/2 [00:00<00:00, 907.86it/s]
pickup_hour | pickup_location_id | rides | |
---|---|---|---|
0 | 2022-01-01 00:00:00 | 1 | 2 |
1 | 2022-01-01 01:00:00 | 1 | 3 |
2 | 2022-01-01 02:00:00 | 1 | 0 |
3 | 2022-01-01 03:00:00 | 1 | 1 |
4 | 2022-01-01 04:00:00 | 1 | 0 |
5 | 2022-01-01 05:00:00 | 1 | 0 |
6 | 2022-01-01 00:00:00 | 2 | 0 |
7 | 2022-01-01 01:00:00 | 2 | 1 |
8 | 2022-01-01 02:00:00 | 2 | 2 |
9 | 2022-01-01 03:00:00 | 2 | 0 |
10 | 2022-01-01 04:00:00 | 2 | 0 |
11 | 2022-01-01 05:00:00 | 2 | 1 |
Now, let’s build features and targets to predict the number of rides for the next hour for each location_id, by using the historical number of rides for the last 3 hours
= transform_ts_data_into_features_and_target(
features, targets
df,=3,
n_features='pickup_hour',
datetime_col='pickup_location_id',
entity_col='rides',
value_col=1,
n_targets=1,
step_size='hour'
step_name )
100%|██████████| 2/2 [00:00<00:00, 597.86it/s]
features
rides_previous_3_hour | rides_previous_2_hour | rides_previous_1_hour | pickup_hour | pickup_location_id | |
---|---|---|---|---|---|
0 | 2.0 | 3.0 | 0.0 | 2022-01-01 03:00:00 | 1 |
1 | 3.0 | 0.0 | 1.0 | 2022-01-01 04:00:00 | 1 |
2 | 0.0 | 1.0 | 2.0 | 2022-01-01 03:00:00 | 2 |
3 | 1.0 | 2.0 | 0.0 | 2022-01-01 04:00:00 | 2 |
targets
target_rides_next_hour | |
---|---|
0 | 1.0 |
1 | 0.0 |
2 | 0.0 |
3 | 0.0 |
= pd.concat([features, targets], axis=1)
Xy_df Xy_df
rides_previous_3_hour | rides_previous_2_hour | rides_previous_1_hour | pickup_hour | pickup_location_id | target_rides_next_hour | |
---|---|---|---|---|---|---|
0 | 2.0 | 3.0 | 0.0 | 2022-01-01 03:00:00 | 1 | 1.0 |
1 | 3.0 | 0.0 | 1.0 | 2022-01-01 04:00:00 | 1 | 0.0 |
2 | 0.0 | 1.0 | 2.0 | 2022-01-01 03:00:00 | 2 | 0.0 |
3 | 1.0 | 2.0 | 0.0 | 2022-01-01 04:00:00 | 2 | 0.0 |
Another Example
Montly spaced time series
import pandas as pd
import numpy as np
# Generate timestamp index with monthly frequency
= pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')
date_rng
# Create list of city codes
= ['FOR', 'SP', 'RJ']
cities
# Create dataframe with random sales data for each city on each month
= pd.DataFrame({
df 'date': date_rng,
'city': np.repeat(cities, len(date_rng)//len(cities)),
'sales': np.random.randint(1000, 5000, size=len(date_rng))
}) df
date | city | sales | |
---|---|---|---|
0 | 2020-01-01 | FOR | 4944 |
1 | 2020-02-01 | FOR | 3435 |
2 | 2020-03-01 | FOR | 4543 |
3 | 2020-04-01 | FOR | 3879 |
4 | 2020-05-01 | FOR | 2601 |
5 | 2020-06-01 | FOR | 2922 |
6 | 2020-07-01 | FOR | 4542 |
7 | 2020-08-01 | FOR | 1338 |
8 | 2020-09-01 | FOR | 2938 |
9 | 2020-10-01 | FOR | 2695 |
10 | 2020-11-01 | FOR | 4065 |
11 | 2020-12-01 | FOR | 3864 |
12 | 2021-01-01 | SP | 2652 |
13 | 2021-02-01 | SP | 2137 |
14 | 2021-03-01 | SP | 2663 |
15 | 2021-04-01 | SP | 1168 |
16 | 2021-05-01 | SP | 4523 |
17 | 2021-06-01 | SP | 4135 |
18 | 2021-07-01 | SP | 3566 |
19 | 2021-08-01 | SP | 2121 |
20 | 2021-09-01 | SP | 1070 |
21 | 2021-10-01 | SP | 1624 |
22 | 2021-11-01 | SP | 3034 |
23 | 2021-12-01 | SP | 4063 |
24 | 2022-01-01 | RJ | 2297 |
25 | 2022-02-01 | RJ | 3430 |
26 | 2022-03-01 | RJ | 2903 |
27 | 2022-04-01 | RJ | 4197 |
28 | 2022-05-01 | RJ | 4141 |
29 | 2022-06-01 | RJ | 2899 |
30 | 2022-07-01 | RJ | 4529 |
31 | 2022-08-01 | RJ | 3612 |
32 | 2022-09-01 | RJ | 1856 |
33 | 2022-10-01 | RJ | 4804 |
34 | 2022-11-01 | RJ | 1764 |
35 | 2022-12-01 | RJ | 4425 |
FOR city only have data for 2020 year, RJ only for 2022 and SP only for 2021. Let’s also simulate more missing slots between the years.
# Generate random indices to drop
= np.random.choice(df.index, size=int(len(df)*0.2), replace=False)
drop_indices
# Drop selected rows from dataframe
= df.drop(drop_indices)
df =True, inplace=True)
df.reset_index(drop df
date | city | sales | |
---|---|---|---|
0 | 2020-01-01 | FOR | 4944 |
1 | 2020-02-01 | FOR | 3435 |
2 | 2020-03-01 | FOR | 4543 |
3 | 2020-04-01 | FOR | 3879 |
4 | 2020-05-01 | FOR | 2601 |
5 | 2020-06-01 | FOR | 2922 |
6 | 2020-07-01 | FOR | 4542 |
7 | 2020-08-01 | FOR | 1338 |
8 | 2020-09-01 | FOR | 2938 |
9 | 2020-11-01 | FOR | 4065 |
10 | 2020-12-01 | FOR | 3864 |
11 | 2021-01-01 | SP | 2652 |
12 | 2021-02-01 | SP | 2137 |
13 | 2021-03-01 | SP | 2663 |
14 | 2021-07-01 | SP | 3566 |
15 | 2021-08-01 | SP | 2121 |
16 | 2021-10-01 | SP | 1624 |
17 | 2021-11-01 | SP | 3034 |
18 | 2021-12-01 | SP | 4063 |
19 | 2022-01-01 | RJ | 2297 |
20 | 2022-02-01 | RJ | 3430 |
21 | 2022-03-01 | RJ | 2903 |
22 | 2022-04-01 | RJ | 4197 |
23 | 2022-05-01 | RJ | 4141 |
24 | 2022-06-01 | RJ | 2899 |
25 | 2022-09-01 | RJ | 1856 |
26 | 2022-10-01 | RJ | 4804 |
27 | 2022-11-01 | RJ | 1764 |
28 | 2022-12-01 | RJ | 4425 |
Now lets fill the missing slots with zero values. The function will complete the missing slots with zeros:
= add_missing_slots(df, datetime_col='date', entity_col='city', value_col='sales', freq='MS')
df_full df_full
100%|██████████| 3/3 [00:00<00:00, 843.70it/s]
date | city | sales | |
---|---|---|---|
0 | 2020-01-01 | FOR | 4944 |
1 | 2020-02-01 | FOR | 3435 |
2 | 2020-03-01 | FOR | 4543 |
3 | 2020-04-01 | FOR | 3879 |
4 | 2020-05-01 | FOR | 2601 |
... | ... | ... | ... |
103 | 2022-08-01 | RJ | 0 |
104 | 2022-09-01 | RJ | 1856 |
105 | 2022-10-01 | RJ | 4804 |
106 | 2022-11-01 | RJ | 1764 |
107 | 2022-12-01 | RJ | 4425 |
108 rows × 3 columns
Let’s build a dataset for training a machine learning model to predict the sales for the next 3 months, for each city, based on historical data of sales for the previous 6 months.
= transform_ts_data_into_features_and_target(
features, targets
df_full,=3,
n_features='date',
datetime_col='city',
entity_col='sales',
value_col=1,
n_targets=1,
step_size='month'
step_name )
100%|██████████| 3/3 [00:00<00:00, 205.58it/s]
=1) pd.concat([features, targets], axis
sales_previous_3_month | sales_previous_2_month | sales_previous_1_month | date | city | target_sales_next_month | |
---|---|---|---|---|---|---|
0 | 4944.0 | 3435.0 | 4543.0 | 2020-04-01 | FOR | 3879.0 |
1 | 3435.0 | 4543.0 | 3879.0 | 2020-05-01 | FOR | 2601.0 |
2 | 4543.0 | 3879.0 | 2601.0 | 2020-06-01 | FOR | 2922.0 |
3 | 3879.0 | 2601.0 | 2922.0 | 2020-07-01 | FOR | 4542.0 |
4 | 2601.0 | 2922.0 | 4542.0 | 2020-08-01 | FOR | 1338.0 |
... | ... | ... | ... | ... | ... | ... |
91 | 4197.0 | 4141.0 | 2899.0 | 2022-07-01 | RJ | 0.0 |
92 | 4141.0 | 2899.0 | 0.0 | 2022-08-01 | RJ | 0.0 |
93 | 2899.0 | 0.0 | 0.0 | 2022-09-01 | RJ | 1856.0 |
94 | 0.0 | 0.0 | 1856.0 | 2022-10-01 | RJ | 4804.0 |
95 | 0.0 | 1856.0 | 4804.0 | 2022-11-01 | RJ | 1764.0 |
96 rows × 6 columns
Embedding on Sklearn Pipelines
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
= FunctionTransformer(
add_missing_slots_transformer
add_missing_slots, ={
kw_args'datetime_col': 'date',
'entity_col': 'city',
'value_col': 'sales',
'freq': 'MS'
}
)
= FunctionTransformer(
transform_ts_data_into_features_and_target_transformer
transform_ts_data_into_features_and_target, ={
kw_args'n_features': 3,
'datetime_col': 'date',
'entity_col': 'city',
'value_col': 'sales',
'n_targets': 1,
'step_size': 1,
'step_name': 'month',
'concat_Xy': True
} )
= make_pipeline(
ts_data_to_features_and_target_pipeline
add_missing_slots_transformer,
transform_ts_data_into_features_and_target_transformer
) ts_data_to_features_and_target_pipeline
Pipeline(steps=[('functiontransformer-1', FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>, kw_args={'datetime_col': 'date', 'entity_col': 'city', 'freq': 'MS', 'value_col': 'sales'})), ('functiontransformer-2', FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>, kw_args={'concat_Xy': True, 'datetime_col': 'date', 'entity_col': 'city', 'n_features': 3, 'n_targets': 1, 'step_name': 'month', 'step_size': 1, 'value_col': 'sales'}))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('functiontransformer-1', FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>, kw_args={'datetime_col': 'date', 'entity_col': 'city', 'freq': 'MS', 'value_col': 'sales'})), ('functiontransformer-2', FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>, kw_args={'concat_Xy': True, 'datetime_col': 'date', 'entity_col': 'city', 'n_features': 3, 'n_targets': 1, 'step_name': 'month', 'step_size': 1, 'value_col': 'sales'}))])
FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>, kw_args={'datetime_col': 'date', 'entity_col': 'city', 'freq': 'MS', 'value_col': 'sales'})
FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>, kw_args={'concat_Xy': True, 'datetime_col': 'date', 'entity_col': 'city', 'n_features': 3, 'n_targets': 1, 'step_name': 'month', 'step_size': 1, 'value_col': 'sales'})
= ts_data_to_features_and_target_pipeline.fit_transform(df)
Xy_df Xy_df
100%|██████████| 3/3 [00:00<00:00, 715.47it/s]
100%|██████████| 3/3 [00:00<00:00, 184.12it/s]
sales_previous_3_month | sales_previous_2_month | sales_previous_1_month | date | city | target_sales_next_month | |
---|---|---|---|---|---|---|
0 | 4944.0 | 3435.0 | 4543.0 | 2020-04-01 | FOR | 3879.0 |
1 | 3435.0 | 4543.0 | 3879.0 | 2020-05-01 | FOR | 2601.0 |
2 | 4543.0 | 3879.0 | 2601.0 | 2020-06-01 | FOR | 2922.0 |
3 | 3879.0 | 2601.0 | 2922.0 | 2020-07-01 | FOR | 4542.0 |
4 | 2601.0 | 2922.0 | 4542.0 | 2020-08-01 | FOR | 1338.0 |
... | ... | ... | ... | ... | ... | ... |
91 | 4197.0 | 4141.0 | 2899.0 | 2022-07-01 | RJ | 0.0 |
92 | 4141.0 | 2899.0 | 0.0 | 2022-08-01 | RJ | 0.0 |
93 | 2899.0 | 0.0 | 0.0 | 2022-09-01 | RJ | 1856.0 |
94 | 0.0 | 0.0 | 1856.0 | 2022-10-01 | RJ | 4804.0 |
95 | 0.0 | 1856.0 | 4804.0 | 2022-11-01 | RJ | 1764.0 |
96 rows × 6 columns