ts2ml

Tools to Transform a Time Series into Features and Target Dataset

Install

pip install ts2ml

How to use

import pandas as pd
from ts2ml.core import add_missing_slots
from ts2ml.core import transform_ts_data_into_features_and_target
df = pd.DataFrame({
    'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
    'pickup_location_id': [1, 1, 1, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1]
})
df
pickup_hour pickup_location_id rides
0 2022-01-01 00:00:00 1 2
1 2022-01-01 01:00:00 1 3
2 2022-01-01 03:00:00 1 1
3 2022-01-01 01:00:00 2 1
4 2022-01-01 02:00:00 2 2
5 2022-01-01 05:00:00 2 1

Let’s fill the missing slots with zeros

df = add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
df
100%|██████████| 2/2 [00:00<00:00, 907.86it/s]
pickup_hour pickup_location_id rides
0 2022-01-01 00:00:00 1 2
1 2022-01-01 01:00:00 1 3
2 2022-01-01 02:00:00 1 0
3 2022-01-01 03:00:00 1 1
4 2022-01-01 04:00:00 1 0
5 2022-01-01 05:00:00 1 0
6 2022-01-01 00:00:00 2 0
7 2022-01-01 01:00:00 2 1
8 2022-01-01 02:00:00 2 2
9 2022-01-01 03:00:00 2 0
10 2022-01-01 04:00:00 2 0
11 2022-01-01 05:00:00 2 1

Now, let’s build features and targets to predict the number of rides for the next hour for each location_id, by using the historical number of rides for the last 3 hours

features, targets = transform_ts_data_into_features_and_target(
    df,
    n_features=3,
    datetime_col='pickup_hour', 
    entity_col='pickup_location_id', 
    value_col='rides',
    n_targets=1,
    step_size=1,
    step_name='hour'
)
100%|██████████| 2/2 [00:00<00:00, 597.86it/s]
features
rides_previous_3_hour rides_previous_2_hour rides_previous_1_hour pickup_hour pickup_location_id
0 2.0 3.0 0.0 2022-01-01 03:00:00 1
1 3.0 0.0 1.0 2022-01-01 04:00:00 1
2 0.0 1.0 2.0 2022-01-01 03:00:00 2
3 1.0 2.0 0.0 2022-01-01 04:00:00 2
targets
target_rides_next_hour
0 1.0
1 0.0
2 0.0
3 0.0
Xy_df = pd.concat([features, targets], axis=1)
Xy_df
rides_previous_3_hour rides_previous_2_hour rides_previous_1_hour pickup_hour pickup_location_id target_rides_next_hour
0 2.0 3.0 0.0 2022-01-01 03:00:00 1 1.0
1 3.0 0.0 1.0 2022-01-01 04:00:00 1 0.0
2 0.0 1.0 2.0 2022-01-01 03:00:00 2 0.0
3 1.0 2.0 0.0 2022-01-01 04:00:00 2 0.0

Another Example

Montly spaced time series

import pandas as pd
import numpy as np

# Generate timestamp index with monthly frequency
date_rng = pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')

# Create list of city codes
cities = ['FOR', 'SP', 'RJ']

# Create dataframe with random sales data for each city on each month
df = pd.DataFrame({
    'date': date_rng,
    'city': np.repeat(cities, len(date_rng)//len(cities)),
    'sales': np.random.randint(1000, 5000, size=len(date_rng))
})
df
date city sales
0 2020-01-01 FOR 4944
1 2020-02-01 FOR 3435
2 2020-03-01 FOR 4543
3 2020-04-01 FOR 3879
4 2020-05-01 FOR 2601
5 2020-06-01 FOR 2922
6 2020-07-01 FOR 4542
7 2020-08-01 FOR 1338
8 2020-09-01 FOR 2938
9 2020-10-01 FOR 2695
10 2020-11-01 FOR 4065
11 2020-12-01 FOR 3864
12 2021-01-01 SP 2652
13 2021-02-01 SP 2137
14 2021-03-01 SP 2663
15 2021-04-01 SP 1168
16 2021-05-01 SP 4523
17 2021-06-01 SP 4135
18 2021-07-01 SP 3566
19 2021-08-01 SP 2121
20 2021-09-01 SP 1070
21 2021-10-01 SP 1624
22 2021-11-01 SP 3034
23 2021-12-01 SP 4063
24 2022-01-01 RJ 2297
25 2022-02-01 RJ 3430
26 2022-03-01 RJ 2903
27 2022-04-01 RJ 4197
28 2022-05-01 RJ 4141
29 2022-06-01 RJ 2899
30 2022-07-01 RJ 4529
31 2022-08-01 RJ 3612
32 2022-09-01 RJ 1856
33 2022-10-01 RJ 4804
34 2022-11-01 RJ 1764
35 2022-12-01 RJ 4425

FOR city only have data for 2020 year, RJ only for 2022 and SP only for 2021. Let’s also simulate more missing slots between the years.

# Generate random indices to drop
drop_indices = np.random.choice(df.index, size=int(len(df)*0.2), replace=False)

# Drop selected rows from dataframe
df = df.drop(drop_indices)
df.reset_index(drop=True, inplace=True)
df
date city sales
0 2020-01-01 FOR 4944
1 2020-02-01 FOR 3435
2 2020-03-01 FOR 4543
3 2020-04-01 FOR 3879
4 2020-05-01 FOR 2601
5 2020-06-01 FOR 2922
6 2020-07-01 FOR 4542
7 2020-08-01 FOR 1338
8 2020-09-01 FOR 2938
9 2020-11-01 FOR 4065
10 2020-12-01 FOR 3864
11 2021-01-01 SP 2652
12 2021-02-01 SP 2137
13 2021-03-01 SP 2663
14 2021-07-01 SP 3566
15 2021-08-01 SP 2121
16 2021-10-01 SP 1624
17 2021-11-01 SP 3034
18 2021-12-01 SP 4063
19 2022-01-01 RJ 2297
20 2022-02-01 RJ 3430
21 2022-03-01 RJ 2903
22 2022-04-01 RJ 4197
23 2022-05-01 RJ 4141
24 2022-06-01 RJ 2899
25 2022-09-01 RJ 1856
26 2022-10-01 RJ 4804
27 2022-11-01 RJ 1764
28 2022-12-01 RJ 4425

Now lets fill the missing slots with zero values. The function will complete the missing slots with zeros:

df_full = add_missing_slots(df, datetime_col='date', entity_col='city', value_col='sales', freq='MS')
df_full
100%|██████████| 3/3 [00:00<00:00, 843.70it/s]
date city sales
0 2020-01-01 FOR 4944
1 2020-02-01 FOR 3435
2 2020-03-01 FOR 4543
3 2020-04-01 FOR 3879
4 2020-05-01 FOR 2601
... ... ... ...
103 2022-08-01 RJ 0
104 2022-09-01 RJ 1856
105 2022-10-01 RJ 4804
106 2022-11-01 RJ 1764
107 2022-12-01 RJ 4425

108 rows × 3 columns

Let’s build a dataset for training a machine learning model to predict the sales for the next 3 months, for each city, based on historical data of sales for the previous 6 months.

features, targets = transform_ts_data_into_features_and_target(
    df_full,
    n_features=3,
    datetime_col='date',
    entity_col='city',
    value_col='sales',
    n_targets=1,
    step_size=1,
    step_name='month'
)
100%|██████████| 3/3 [00:00<00:00, 205.58it/s]
pd.concat([features, targets], axis=1)
sales_previous_3_month sales_previous_2_month sales_previous_1_month date city target_sales_next_month
0 4944.0 3435.0 4543.0 2020-04-01 FOR 3879.0
1 3435.0 4543.0 3879.0 2020-05-01 FOR 2601.0
2 4543.0 3879.0 2601.0 2020-06-01 FOR 2922.0
3 3879.0 2601.0 2922.0 2020-07-01 FOR 4542.0
4 2601.0 2922.0 4542.0 2020-08-01 FOR 1338.0
... ... ... ... ... ... ...
91 4197.0 4141.0 2899.0 2022-07-01 RJ 0.0
92 4141.0 2899.0 0.0 2022-08-01 RJ 0.0
93 2899.0 0.0 0.0 2022-09-01 RJ 1856.0
94 0.0 0.0 1856.0 2022-10-01 RJ 4804.0
95 0.0 1856.0 4804.0 2022-11-01 RJ 1764.0

96 rows × 6 columns

Embedding on Sklearn Pipelines

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
add_missing_slots_transformer = FunctionTransformer(
    add_missing_slots, 
    kw_args={
        'datetime_col': 'date', 
        'entity_col': 'city', 
        'value_col': 'sales', 
        'freq': 'MS'
    }
)

transform_ts_data_into_features_and_target_transformer = FunctionTransformer(
    transform_ts_data_into_features_and_target, 
    kw_args={
        'n_features': 3, 
        'datetime_col': 'date', 
        'entity_col': 'city', 
        'value_col': 'sales', 
        'n_targets': 1, 
        'step_size': 1, 
        'step_name': 'month',
        'concat_Xy': True
    }
)
ts_data_to_features_and_target_pipeline = make_pipeline(
    add_missing_slots_transformer,
    transform_ts_data_into_features_and_target_transformer
)
ts_data_to_features_and_target_pipeline
Pipeline(steps=[('functiontransformer-1',
                 FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
                                     kw_args={'datetime_col': 'date',
                                              'entity_col': 'city',
                                              'freq': 'MS',
                                              'value_col': 'sales'})),
                ('functiontransformer-2',
                 FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
                                     kw_args={'concat_Xy': True,
                                              'datetime_col': 'date',
                                              'entity_col': 'city',
                                              'n_features': 3, 'n_targets': 1,
                                              'step_name': 'month',
                                              'step_size': 1,
                                              'value_col': 'sales'}))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Xy_df = ts_data_to_features_and_target_pipeline.fit_transform(df)
Xy_df
100%|██████████| 3/3 [00:00<00:00, 715.47it/s]
100%|██████████| 3/3 [00:00<00:00, 184.12it/s]
sales_previous_3_month sales_previous_2_month sales_previous_1_month date city target_sales_next_month
0 4944.0 3435.0 4543.0 2020-04-01 FOR 3879.0
1 3435.0 4543.0 3879.0 2020-05-01 FOR 2601.0
2 4543.0 3879.0 2601.0 2020-06-01 FOR 2922.0
3 3879.0 2601.0 2922.0 2020-07-01 FOR 4542.0
4 2601.0 2922.0 4542.0 2020-08-01 FOR 1338.0
... ... ... ... ... ... ...
91 4197.0 4141.0 2899.0 2022-07-01 RJ 0.0
92 4141.0 2899.0 0.0 2022-08-01 RJ 0.0
93 2899.0 0.0 0.0 2022-09-01 RJ 1856.0
94 0.0 0.0 1856.0 2022-10-01 RJ 4804.0
95 0.0 1856.0 4804.0 2022-11-01 RJ 1764.0

96 rows × 6 columns