core

Core functions

add_missing_slots

 add_missing_slots (df:pandas.core.frame.DataFrame, datetime_col:str,
                    entity_col:str, value_col:str, freq:str='H',
                    fill_value:int=0)

Add missing slots to a time series dataframe. This function is useful to fill missing slots in a time series dataframe. For example, if a time series is associated to a location, this function will add missing slots for each location. Missing slots are filled with the value specified in the ‘fill_value’ parameter. By default, the frequency of the time series is hourly.

	Type	Default	Details
df	DataFrame		input dataframe with datetime, entity and value columns - time series format
datetime_col	str		name of the datetime column
entity_col	str		name of the entity column. If a time series is associated to a location, this column will be ‘location_id’
value_col	str		name of the value column
freq	str	H	frequency of the time series. Default is hourly
fill_value	int	0	value to use to fill missing slots
Returns	DataFrame

df = pd.DataFrame({
    'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
    'pickup_location_id': [1, 1, 1, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1]
})
df

	pickup_hour	pickup_location_id	rides
0	2022-01-01 00:00:00	1	2
1	2022-01-01 01:00:00	1	3
2	2022-01-01 03:00:00	1	1
3	2022-01-01 01:00:00	2	1
4	2022-01-01 02:00:00	2	2
5	2022-01-01 05:00:00	2	1

add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')

100%|██████████| 2/2 [00:00<00:00, 448.83it/s]

	pickup_hour	pickup_location_id	rides
0	2022-01-01 00:00:00	1	2
1	2022-01-01 01:00:00	1	3
2	2022-01-01 02:00:00	1	0
3	2022-01-01 03:00:00	1	1
4	2022-01-01 04:00:00	1	0
5	2022-01-01 05:00:00	1	0
6	2022-01-01 00:00:00	2	0
7	2022-01-01 01:00:00	2	1
8	2022-01-01 02:00:00	2	2
9	2022-01-01 03:00:00	2	0
10	2022-01-01 04:00:00	2	0
11	2022-01-01 05:00:00	2	1

source

get_cutoff_indices_features_and_target

 get_cutoff_indices_features_and_target
                                         (ts_data:pandas.core.frame.DataFr
                                         ame, datetime_col:str,
                                         n_features:int, n_targets:int=1,
                                         step_size:int=1)

Function to get the indices for the cutoffs of a Time Series DataFrame. The Time Series DataFrame should be orderded by time.

	Type	Default	Details
ts_data	DataFrame		Time Series DataFrame
datetime_col	str		Name of the datetime column
n_features	int		Number of features to use for the prediction
n_targets	int	1	Number of target values to predict
step_size	int	1	Step size to use to slide the Time Series DataFrame
Returns	typing.List[tuple]

# build a time series dataframe with 10 hours of data in random order
ts_data = pd.DataFrame({
    'pickup_hour': ['2022-01-01 01:00:00', '2022-01-01 00:00:00', '2022-01-01 03:00:00', '2022-01-01 04:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00', '2022-01-01 09:00:00', '2022-01-01 06:00:00', '2022-01-01 07:00:00', '2022-01-01 08:00:00'],
    'rides': [2, 3, 1, 1, 2, 1, 1, 2, 1, 1]
})
ts_data

	pickup_hour	rides
0	2022-01-01 01:00:00	2
1	2022-01-01 00:00:00	3
2	2022-01-01 03:00:00	1
3	2022-01-01 04:00:00	1
4	2022-01-01 02:00:00	2
5	2022-01-01 05:00:00	1
6	2022-01-01 09:00:00	1
7	2022-01-01 06:00:00	2
8	2022-01-01 07:00:00	1
9	2022-01-01 08:00:00	1

# the time series should be ordered by time, otherwise it will not work and throw a ValueError
ts_data.sort_values(by='pickup_hour', inplace=True, ignore_index=True)
cutoff_idxs = get_cutoff_indices_features_and_target(ts_data, datetime_col='pickup_hour', n_features=3, n_targets=2, step_size=1)
cutoff_idxs

[(0, 3, 5), (1, 4, 6), (2, 5, 7), (3, 6, 8), (4, 7, 9)]

assert cutoff_idxs == [(0, 3, 5), (1, 4, 6), (2, 5, 7), (3, 6, 8), (4, 7, 9)]

source

transform_ts_data_into_features_and_target

 transform_ts_data_into_features_and_target
                                             (ts_data:pandas.core.frame.Da
                                             taFrame, n_features:int,
                                             datetime_col:str,
                                             entity_col:str,
                                             value_col:str,
                                             n_targets:int=1,
                                             step_size:int=1,
                                             step_name:str=None,
                                             concat_Xy:bool=False)

Slices and transposes data from time-series format into a (features, target) format that we can use to train Supervised ML models.

	Type	Default	Details
ts_data	DataFrame		Time Series DataFrame
n_features	int		Number of features to use for the prediction
datetime_col	str		Name of the datetime column
entity_col	str		Name of the entity column, e.g. location_id
value_col	str		Name of the value column
n_targets	int	1	Number of target values to predict
step_size	int	1	Step size to use to slide the Time Series DataFrame
step_name	str	None	Name of the step column
concat_Xy	bool	False	Whether to concat X and y on the same dataframe or not
Returns	DataFrame

# build a time series dataframe with 10 hours of data in random order and a location id column with 1 and 2
ts_data = pd.DataFrame({
    'pickup_hour': ['2022-01-01 01:00:00', '2022-01-01 00:00:00', '2022-01-01 03:00:00', '2022-01-01 04:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00', '2022-01-01 09:00:00', '2022-01-01 06:00:00', '2022-01-01 07:00:00', '2022-01-01 08:00:00'],
    'location_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1, 1, 2, 1, 1]
})
ts_data

	pickup_hour	location_id	rides
0	2022-01-01 01:00:00	1	2
1	2022-01-01 00:00:00	1	3
2	2022-01-01 03:00:00	1	1
3	2022-01-01 04:00:00	1	1
4	2022-01-01 02:00:00	1	2
5	2022-01-01 05:00:00	1	1
6	2022-01-01 09:00:00	2	1
7	2022-01-01 06:00:00	2	2
8	2022-01-01 07:00:00	2	1
9	2022-01-01 08:00:00	2	1

ts_data = add_missing_slots(ts_data, datetime_col='pickup_hour', entity_col='location_id', value_col='rides', freq='1H')
ts_data

100%|██████████| 2/2 [00:00<00:00, 708.92it/s]

	pickup_hour	location_id	rides
0	2022-01-01 00:00:00	1	3
1	2022-01-01 01:00:00	1	2
2	2022-01-01 02:00:00	1	2
3	2022-01-01 03:00:00	1	1
4	2022-01-01 04:00:00	1	1
5	2022-01-01 05:00:00	1	1
6	2022-01-01 06:00:00	1	0
7	2022-01-01 07:00:00	1	0
8	2022-01-01 08:00:00	1	0
9	2022-01-01 09:00:00	1	0
10	2022-01-01 00:00:00	2	0
11	2022-01-01 01:00:00	2	0
12	2022-01-01 02:00:00	2	0
13	2022-01-01 03:00:00	2	0
14	2022-01-01 04:00:00	2	0
15	2022-01-01 05:00:00	2	0
16	2022-01-01 06:00:00	2	2
17	2022-01-01 07:00:00	2	1
18	2022-01-01 08:00:00	2	1
19	2022-01-01 09:00:00	2	1

features, targets = transform_ts_data_into_features_and_target(
    ts_data=ts_data,
    n_features=3,
    datetime_col='pickup_hour',
    entity_col='location_id',
    value_col='rides',
    n_targets=2,
    step_size=1
)

100%|██████████| 2/2 [00:00<00:00, 371.60it/s]

features

	rides_previous_3	rides_previous_2	rides_previous_1	pickup_hour	location_id
0	3.0	2.0	2.0	2022-01-01 03:00:00	1
1	2.0	2.0	1.0	2022-01-01 04:00:00	1
2	2.0	1.0	1.0	2022-01-01 05:00:00	1
3	1.0	1.0	1.0	2022-01-01 06:00:00	1
4	1.0	1.0	0.0	2022-01-01 07:00:00	1
5	0.0	0.0	0.0	2022-01-01 03:00:00	2
6	0.0	0.0	0.0	2022-01-01 04:00:00	2
7	0.0	0.0	0.0	2022-01-01 05:00:00	2
8	0.0	0.0	0.0	2022-01-01 06:00:00	2
9	0.0	0.0	2.0	2022-01-01 07:00:00	2

targets

	target_rides_next_1	target_rides_next_2
0	1.0	1.0
1	1.0	1.0
2	1.0	0.0
3	0.0	0.0
4	0.0	0.0
5	0.0	0.0
6	0.0	0.0
7	0.0	2.0
8	2.0	1.0
9	1.0	1.0

pd.concat([features, targets], axis=1)

	rides_previous_3	rides_previous_2	rides_previous_1	pickup_hour	location_id	target_rides_next_1	target_rides_next_2
0	3.0	2.0	2.0	2022-01-01 03:00:00	1	1.0	1.0
1	2.0	2.0	1.0	2022-01-01 04:00:00	1	1.0	1.0
2	2.0	1.0	1.0	2022-01-01 05:00:00	1	1.0	0.0
3	1.0	1.0	1.0	2022-01-01 06:00:00	1	0.0	0.0
4	1.0	1.0	0.0	2022-01-01 07:00:00	1	0.0	0.0
5	0.0	0.0	0.0	2022-01-01 03:00:00	2	0.0	0.0
6	0.0	0.0	0.0	2022-01-01 04:00:00	2	0.0	0.0
7	0.0	0.0	0.0	2022-01-01 05:00:00	2	0.0	2.0
8	0.0	0.0	0.0	2022-01-01 06:00:00	2	2.0	1.0
9	0.0	0.0	2.0	2022-01-01 07:00:00	2	1.0	1.0

This dataset could be use to predict the rides for the next 2 hours, for each location_id, by using historical rides from the previous 3 hours.