ts2ml

Tools to Transform a Time Series into Features and Target Dataset

Install

pip install ts2ml

How to use

import pandas as pd
from ts2ml.core import add_missing_slots
from ts2ml.core import transform_ts_data_into_features_and_target

df = pd.DataFrame({
    'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
    'pickup_location_id': [1, 1, 1, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1]
})
df

	pickup_hour	pickup_location_id	rides
0	2022-01-01 00:00:00	1	2
1	2022-01-01 01:00:00	1	3
2	2022-01-01 03:00:00	1	1
3	2022-01-01 01:00:00	2	1
4	2022-01-01 02:00:00	2	2
5	2022-01-01 05:00:00	2	1

Let’s fill the missing slots with zeros

df = add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
df

100%|██████████| 2/2 [00:00<00:00, 907.86it/s]

	pickup_hour	pickup_location_id	rides
0	2022-01-01 00:00:00	1	2
1	2022-01-01 01:00:00	1	3
2	2022-01-01 02:00:00	1	0
3	2022-01-01 03:00:00	1	1
4	2022-01-01 04:00:00	1	0
5	2022-01-01 05:00:00	1	0
6	2022-01-01 00:00:00	2	0
7	2022-01-01 01:00:00	2	1
8	2022-01-01 02:00:00	2	2
9	2022-01-01 03:00:00	2	0
10	2022-01-01 04:00:00	2	0
11	2022-01-01 05:00:00	2	1

Now, let’s build features and targets to predict the number of rides for the next hour for each location_id, by using the historical number of rides for the last 3 hours

features, targets = transform_ts_data_into_features_and_target(
    df,
    n_features=3,
    datetime_col='pickup_hour', 
    entity_col='pickup_location_id', 
    value_col='rides',
    n_targets=1,
    step_size=1,
    step_name='hour'
)

100%|██████████| 2/2 [00:00<00:00, 597.86it/s]

features

	rides_previous_3_hour	rides_previous_2_hour	rides_previous_1_hour	pickup_hour	pickup_location_id
0	2.0	3.0	0.0	2022-01-01 03:00:00	1
1	3.0	0.0	1.0	2022-01-01 04:00:00	1
2	0.0	1.0	2.0	2022-01-01 03:00:00	2
3	1.0	2.0	0.0	2022-01-01 04:00:00	2

targets

	target_rides_next_hour
0	1.0
1	0.0
2	0.0
3	0.0

Xy_df = pd.concat([features, targets], axis=1)
Xy_df

	rides_previous_3_hour	rides_previous_2_hour	rides_previous_1_hour	pickup_hour	pickup_location_id	target_rides_next_hour
0	2.0	3.0	0.0	2022-01-01 03:00:00	1	1.0
1	3.0	0.0	1.0	2022-01-01 04:00:00	1	0.0
2	0.0	1.0	2.0	2022-01-01 03:00:00	2	0.0
3	1.0	2.0	0.0	2022-01-01 04:00:00	2	0.0

Another Example

Montly spaced time series

import pandas as pd
import numpy as np

# Generate timestamp index with monthly frequency
date_rng = pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')

# Create list of city codes
cities = ['FOR', 'SP', 'RJ']

# Create dataframe with random sales data for each city on each month
df = pd.DataFrame({
    'date': date_rng,
    'city': np.repeat(cities, len(date_rng)//len(cities)),
    'sales': np.random.randint(1000, 5000, size=len(date_rng))
})
df

	date	city	sales
0	2020-01-01	FOR	4944
1	2020-02-01	FOR	3435
2	2020-03-01	FOR	4543
3	2020-04-01	FOR	3879
4	2020-05-01	FOR	2601
5	2020-06-01	FOR	2922
6	2020-07-01	FOR	4542
7	2020-08-01	FOR	1338
8	2020-09-01	FOR	2938
9	2020-10-01	FOR	2695
10	2020-11-01	FOR	4065
11	2020-12-01	FOR	3864
12	2021-01-01	SP	2652
13	2021-02-01	SP	2137
14	2021-03-01	SP	2663
15	2021-04-01	SP	1168
16	2021-05-01	SP	4523
17	2021-06-01	SP	4135
18	2021-07-01	SP	3566
19	2021-08-01	SP	2121
20	2021-09-01	SP	1070
21	2021-10-01	SP	1624
22	2021-11-01	SP	3034
23	2021-12-01	SP	4063
24	2022-01-01	RJ	2297
25	2022-02-01	RJ	3430
26	2022-03-01	RJ	2903
27	2022-04-01	RJ	4197
28	2022-05-01	RJ	4141
29	2022-06-01	RJ	2899
30	2022-07-01	RJ	4529
31	2022-08-01	RJ	3612
32	2022-09-01	RJ	1856
33	2022-10-01	RJ	4804
34	2022-11-01	RJ	1764
35	2022-12-01	RJ	4425

FOR city only have data for 2020 year, RJ only for 2022 and SP only for 2021. Let’s also simulate more missing slots between the years.

# Generate random indices to drop
drop_indices = np.random.choice(df.index, size=int(len(df)*0.2), replace=False)

# Drop selected rows from dataframe
df = df.drop(drop_indices)
df.reset_index(drop=True, inplace=True)
df

	date	city	sales
0	2020-01-01	FOR	4944
1	2020-02-01	FOR	3435
2	2020-03-01	FOR	4543
3	2020-04-01	FOR	3879
4	2020-05-01	FOR	2601
5	2020-06-01	FOR	2922
6	2020-07-01	FOR	4542
7	2020-08-01	FOR	1338
8	2020-09-01	FOR	2938
9	2020-11-01	FOR	4065
10	2020-12-01	FOR	3864
11	2021-01-01	SP	2652
12	2021-02-01	SP	2137
13	2021-03-01	SP	2663
14	2021-07-01	SP	3566
15	2021-08-01	SP	2121
16	2021-10-01	SP	1624
17	2021-11-01	SP	3034
18	2021-12-01	SP	4063
19	2022-01-01	RJ	2297
20	2022-02-01	RJ	3430
21	2022-03-01	RJ	2903
22	2022-04-01	RJ	4197
23	2022-05-01	RJ	4141
24	2022-06-01	RJ	2899
25	2022-09-01	RJ	1856
26	2022-10-01	RJ	4804
27	2022-11-01	RJ	1764
28	2022-12-01	RJ	4425

Now lets fill the missing slots with zero values. The function will complete the missing slots with zeros:

df_full = add_missing_slots(df, datetime_col='date', entity_col='city', value_col='sales', freq='MS')
df_full

100%|██████████| 3/3 [00:00<00:00, 843.70it/s]

	date	city	sales
0	2020-01-01	FOR	4944
1	2020-02-01	FOR	3435
2	2020-03-01	FOR	4543
3	2020-04-01	FOR	3879
4	2020-05-01	FOR	2601
...	...	...	...
103	2022-08-01	RJ	0
104	2022-09-01	RJ	1856
105	2022-10-01	RJ	4804
106	2022-11-01	RJ	1764
107	2022-12-01	RJ	4425

108 rows × 3 columns

Let’s build a dataset for training a machine learning model to predict the sales for the next 3 months, for each city, based on historical data of sales for the previous 6 months.

features, targets = transform_ts_data_into_features_and_target(
    df_full,
    n_features=3,
    datetime_col='date',
    entity_col='city',
    value_col='sales',
    n_targets=1,
    step_size=1,
    step_name='month'
)

100%|██████████| 3/3 [00:00<00:00, 205.58it/s]

pd.concat([features, targets], axis=1)

	sales_previous_3_month	sales_previous_2_month	sales_previous_1_month	date	city	target_sales_next_month
0	4944.0	3435.0	4543.0	2020-04-01	FOR	3879.0
1	3435.0	4543.0	3879.0	2020-05-01	FOR	2601.0
2	4543.0	3879.0	2601.0	2020-06-01	FOR	2922.0
3	3879.0	2601.0	2922.0	2020-07-01	FOR	4542.0
4	2601.0	2922.0	4542.0	2020-08-01	FOR	1338.0
...	...	...	...	...	...	...
91	4197.0	4141.0	2899.0	2022-07-01	RJ	0.0
92	4141.0	2899.0	0.0	2022-08-01	RJ	0.0
93	2899.0	0.0	0.0	2022-09-01	RJ	1856.0
94	0.0	0.0	1856.0	2022-10-01	RJ	4804.0
95	0.0	1856.0	4804.0	2022-11-01	RJ	1764.0

96 rows × 6 columns

Embedding on Sklearn Pipelines

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

add_missing_slots_transformer = FunctionTransformer(
    add_missing_slots, 
    kw_args={
        'datetime_col': 'date', 
        'entity_col': 'city', 
        'value_col': 'sales', 
        'freq': 'MS'
    }
)

transform_ts_data_into_features_and_target_transformer = FunctionTransformer(
    transform_ts_data_into_features_and_target, 
    kw_args={
        'n_features': 3, 
        'datetime_col': 'date', 
        'entity_col': 'city', 
        'value_col': 'sales', 
        'n_targets': 1, 
        'step_size': 1, 
        'step_name': 'month',
        'concat_Xy': True
    }
)

ts_data_to_features_and_target_pipeline = make_pipeline(
    add_missing_slots_transformer,
    transform_ts_data_into_features_and_target_transformer
)
ts_data_to_features_and_target_pipeline

Pipeline(steps=[('functiontransformer-1',
                 FunctionTransformer(func=<function add_missing_slots at 0x11f8f49d0>,
                                     kw_args={'datetime_col': 'date',
                                              'entity_col': 'city',
                                              'freq': 'MS',
                                              'value_col': 'sales'})),
                ('functiontransformer-2',
                 FunctionTransformer(func=<function transform_ts_data_into_features_and_target at 0x11f925ca0>,
                                     kw_args={'concat_Xy': True,
                                              'datetime_col': 'date',
                                              'entity_col': 'city',
                                              'n_features': 3, 'n_targets': 1,
                                              'step_name': 'month',
                                              'step_size': 1,
                                              'value_col': 'sales'}))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Xy_df = ts_data_to_features_and_target_pipeline.fit_transform(df)
Xy_df

100%|██████████| 3/3 [00:00<00:00, 715.47it/s]
100%|██████████| 3/3 [00:00<00:00, 184.12it/s]

	sales_previous_3_month	sales_previous_2_month	sales_previous_1_month	date	city	target_sales_next_month
0	4944.0	3435.0	4543.0	2020-04-01	FOR	3879.0
1	3435.0	4543.0	3879.0	2020-05-01	FOR	2601.0
2	4543.0	3879.0	2601.0	2020-06-01	FOR	2922.0
3	3879.0	2601.0	2922.0	2020-07-01	FOR	4542.0
4	2601.0	2922.0	4542.0	2020-08-01	FOR	1338.0
...	...	...	...	...	...	...
91	4197.0	4141.0	2899.0	2022-07-01	RJ	0.0
92	4141.0	2899.0	0.0	2022-08-01	RJ	0.0
93	2899.0	0.0	0.0	2022-09-01	RJ	1856.0
94	0.0	0.0	1856.0	2022-10-01	RJ	4804.0
95	0.0	1856.0	4804.0	2022-11-01	RJ	1764.0

96 rows × 6 columns