core

Core functions

source

add_missing_slots

 add_missing_slots (df:pandas.core.frame.DataFrame, datetime_col:str,
                    entity_col:str, value_col:str, freq:str='H',
                    fill_value:int=0)

Add missing slots to a time series dataframe. This function is useful to fill missing slots in a time series dataframe. For example, if a time series is associated to a location, this function will add missing slots for each location. Missing slots are filled with the value specified in the ‘fill_value’ parameter. By default, the frequency of the time series is hourly.

Type Default Details
df DataFrame input dataframe with datetime, entity and value columns - time series format
datetime_col str name of the datetime column
entity_col str name of the entity column. If a time series is associated to a location, this column will be ‘location_id’
value_col str name of the value column
freq str H frequency of the time series. Default is hourly
fill_value int 0 value to use to fill missing slots
Returns DataFrame
df = pd.DataFrame({
    'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
    'pickup_location_id': [1, 1, 1, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1]
})
df
pickup_hour pickup_location_id rides
0 2022-01-01 00:00:00 1 2
1 2022-01-01 01:00:00 1 3
2 2022-01-01 03:00:00 1 1
3 2022-01-01 01:00:00 2 1
4 2022-01-01 02:00:00 2 2
5 2022-01-01 05:00:00 2 1
add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
100%|██████████| 2/2 [00:00<00:00, 448.83it/s]
pickup_hour pickup_location_id rides
0 2022-01-01 00:00:00 1 2
1 2022-01-01 01:00:00 1 3
2 2022-01-01 02:00:00 1 0
3 2022-01-01 03:00:00 1 1
4 2022-01-01 04:00:00 1 0
5 2022-01-01 05:00:00 1 0
6 2022-01-01 00:00:00 2 0
7 2022-01-01 01:00:00 2 1
8 2022-01-01 02:00:00 2 2
9 2022-01-01 03:00:00 2 0
10 2022-01-01 04:00:00 2 0
11 2022-01-01 05:00:00 2 1

source

get_cutoff_indices_features_and_target

 get_cutoff_indices_features_and_target
                                         (ts_data:pandas.core.frame.DataFr
                                         ame, datetime_col:str,
                                         n_features:int, n_targets:int=1,
                                         step_size:int=1)

Function to get the indices for the cutoffs of a Time Series DataFrame. The Time Series DataFrame should be orderded by time.

Type Default Details
ts_data DataFrame Time Series DataFrame
datetime_col str Name of the datetime column
n_features int Number of features to use for the prediction
n_targets int 1 Number of target values to predict
step_size int 1 Step size to use to slide the Time Series DataFrame
Returns typing.List[tuple]
# build a time series dataframe with 10 hours of data in random order
ts_data = pd.DataFrame({
    'pickup_hour': ['2022-01-01 01:00:00', '2022-01-01 00:00:00', '2022-01-01 03:00:00', '2022-01-01 04:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00', '2022-01-01 09:00:00', '2022-01-01 06:00:00', '2022-01-01 07:00:00', '2022-01-01 08:00:00'],
    'rides': [2, 3, 1, 1, 2, 1, 1, 2, 1, 1]
})
ts_data
pickup_hour rides
0 2022-01-01 01:00:00 2
1 2022-01-01 00:00:00 3
2 2022-01-01 03:00:00 1
3 2022-01-01 04:00:00 1
4 2022-01-01 02:00:00 2
5 2022-01-01 05:00:00 1
6 2022-01-01 09:00:00 1
7 2022-01-01 06:00:00 2
8 2022-01-01 07:00:00 1
9 2022-01-01 08:00:00 1
# the time series should be ordered by time, otherwise it will not work and throw a ValueError
ts_data.sort_values(by='pickup_hour', inplace=True, ignore_index=True)
cutoff_idxs = get_cutoff_indices_features_and_target(ts_data, datetime_col='pickup_hour', n_features=3, n_targets=2, step_size=1)
cutoff_idxs
[(0, 3, 5), (1, 4, 6), (2, 5, 7), (3, 6, 8), (4, 7, 9)]
assert cutoff_idxs == [(0, 3, 5), (1, 4, 6), (2, 5, 7), (3, 6, 8), (4, 7, 9)]

source

transform_ts_data_into_features_and_target

 transform_ts_data_into_features_and_target
                                             (ts_data:pandas.core.frame.Da
                                             taFrame, n_features:int,
                                             datetime_col:str,
                                             entity_col:str,
                                             value_col:str,
                                             n_targets:int=1,
                                             step_size:int=1,
                                             step_name:str=None,
                                             concat_Xy:bool=False)

Slices and transposes data from time-series format into a (features, target) format that we can use to train Supervised ML models.

Type Default Details
ts_data DataFrame Time Series DataFrame
n_features int Number of features to use for the prediction
datetime_col str Name of the datetime column
entity_col str Name of the entity column, e.g. location_id
value_col str Name of the value column
n_targets int 1 Number of target values to predict
step_size int 1 Step size to use to slide the Time Series DataFrame
step_name str None Name of the step column
concat_Xy bool False Whether to concat X and y on the same dataframe or not
Returns DataFrame
# build a time series dataframe with 10 hours of data in random order and a location id column with 1 and 2
ts_data = pd.DataFrame({
    'pickup_hour': ['2022-01-01 01:00:00', '2022-01-01 00:00:00', '2022-01-01 03:00:00', '2022-01-01 04:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00', '2022-01-01 09:00:00', '2022-01-01 06:00:00', '2022-01-01 07:00:00', '2022-01-01 08:00:00'],
    'location_id': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1, 1, 2, 1, 1]
})
ts_data
pickup_hour location_id rides
0 2022-01-01 01:00:00 1 2
1 2022-01-01 00:00:00 1 3
2 2022-01-01 03:00:00 1 1
3 2022-01-01 04:00:00 1 1
4 2022-01-01 02:00:00 1 2
5 2022-01-01 05:00:00 1 1
6 2022-01-01 09:00:00 2 1
7 2022-01-01 06:00:00 2 2
8 2022-01-01 07:00:00 2 1
9 2022-01-01 08:00:00 2 1
ts_data = add_missing_slots(ts_data, datetime_col='pickup_hour', entity_col='location_id', value_col='rides', freq='1H')
ts_data
100%|██████████| 2/2 [00:00<00:00, 708.92it/s]
pickup_hour location_id rides
0 2022-01-01 00:00:00 1 3
1 2022-01-01 01:00:00 1 2
2 2022-01-01 02:00:00 1 2
3 2022-01-01 03:00:00 1 1
4 2022-01-01 04:00:00 1 1
5 2022-01-01 05:00:00 1 1
6 2022-01-01 06:00:00 1 0
7 2022-01-01 07:00:00 1 0
8 2022-01-01 08:00:00 1 0
9 2022-01-01 09:00:00 1 0
10 2022-01-01 00:00:00 2 0
11 2022-01-01 01:00:00 2 0
12 2022-01-01 02:00:00 2 0
13 2022-01-01 03:00:00 2 0
14 2022-01-01 04:00:00 2 0
15 2022-01-01 05:00:00 2 0
16 2022-01-01 06:00:00 2 2
17 2022-01-01 07:00:00 2 1
18 2022-01-01 08:00:00 2 1
19 2022-01-01 09:00:00 2 1
features, targets = transform_ts_data_into_features_and_target(
    ts_data=ts_data,
    n_features=3,
    datetime_col='pickup_hour',
    entity_col='location_id',
    value_col='rides',
    n_targets=2,
    step_size=1
)
100%|██████████| 2/2 [00:00<00:00, 371.60it/s]
features
rides_previous_3 rides_previous_2 rides_previous_1 pickup_hour location_id
0 3.0 2.0 2.0 2022-01-01 03:00:00 1
1 2.0 2.0 1.0 2022-01-01 04:00:00 1
2 2.0 1.0 1.0 2022-01-01 05:00:00 1
3 1.0 1.0 1.0 2022-01-01 06:00:00 1
4 1.0 1.0 0.0 2022-01-01 07:00:00 1
5 0.0 0.0 0.0 2022-01-01 03:00:00 2
6 0.0 0.0 0.0 2022-01-01 04:00:00 2
7 0.0 0.0 0.0 2022-01-01 05:00:00 2
8 0.0 0.0 0.0 2022-01-01 06:00:00 2
9 0.0 0.0 2.0 2022-01-01 07:00:00 2
targets
target_rides_next_1 target_rides_next_2
0 1.0 1.0
1 1.0 1.0
2 1.0 0.0
3 0.0 0.0
4 0.0 0.0
5 0.0 0.0
6 0.0 0.0
7 0.0 2.0
8 2.0 1.0
9 1.0 1.0
pd.concat([features, targets], axis=1)
rides_previous_3 rides_previous_2 rides_previous_1 pickup_hour location_id target_rides_next_1 target_rides_next_2
0 3.0 2.0 2.0 2022-01-01 03:00:00 1 1.0 1.0
1 2.0 2.0 1.0 2022-01-01 04:00:00 1 1.0 1.0
2 2.0 1.0 1.0 2022-01-01 05:00:00 1 1.0 0.0
3 1.0 1.0 1.0 2022-01-01 06:00:00 1 0.0 0.0
4 1.0 1.0 0.0 2022-01-01 07:00:00 1 0.0 0.0
5 0.0 0.0 0.0 2022-01-01 03:00:00 2 0.0 0.0
6 0.0 0.0 0.0 2022-01-01 04:00:00 2 0.0 0.0
7 0.0 0.0 0.0 2022-01-01 05:00:00 2 0.0 2.0
8 0.0 0.0 0.0 2022-01-01 06:00:00 2 2.0 1.0
9 0.0 0.0 2.0 2022-01-01 07:00:00 2 1.0 1.0

This dataset could be use to predict the rides for the next 2 hours, for each location_id, by using historical rides from the previous 3 hours.