Alternative solutions for dataframe.rolling().apply(function) because its axis=1 is not supported yet!

jcgtanaka · March 28, 2023, 12:54am

Greetings for the day!

I want to ask the following:

I have daily data from Apple stock as cudf.DataFrame type.

Then I want to run a random forest algo together with a shap feature selection.

I want to apply these algos daily. I know that the rolling.apply method doesn’t support the axis=1 yet, so I tried created a numba cuda kernel to do the whole backtesting for the entire dataset in parallel.

I tried using a list comprehension, but it actually uses the CPU, which I don’t want. I want to do all in GPU.

Could you help me, please?

This is the whole code:

The following code block impor the necessary libraries:

import cudf
import cuml
import cupy
import yfinance as yf
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shap
%matplotlib inline
plt.style.use('seaborn-darkgrid')
## Model Interpretation package
import shap
shap.initjs()

from numba import cuda

from cuml import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score
from cupy import asnumpy
from joblib import dump, load

import warnings
warnings.filterwarnings("ignore")

The following block imports the data and prepare the features and the feature prediction:

df = cudf.DataFrame.from_pandas(yf.download('TSLA', start='2001-01-01', end='2022-12-31', auto_adjust=True))
inputs_names = [f'{column}_ret' for column in df.columns]
for column in df.columns:
    df[f'{column}_ret'] = df[column].pct_change()
df['Close_lead'] = df.Close.shift(-1)
df.dropna(inplace=True)
df['y'] = cupy.where(df['Close_lead']>df['Close'],1,0)
df

The following code block is the function which I want to implement daily in the df dataframe:

def backtesting_algo(df): 
    data = df.copy()
    
    X, y = df[inputs_names], df['y']
    
    X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.02, random_state = 0)

    # random forest depth and size
    n_estimators = 25
    max_depth = 10
    

    model = cuRF( max_depth = max_depth,
                  n_estimators = n_estimators,
                  random_state  = 0 )
    
    trained_RF = model.fit(X, y)
    
    cu_explainer = cuml.explainer.KernelExplainer(model=trained_RF.predict,
                                   data=X_test,
                                   is_gpu_model=True)
    cu_shap_values = cu_explainer.shap_values(X_test)

    # save
    dump(trained_RF, 'RF.model')

    # to reload the model uncomment the line below
    loaded_model = load('RF.model')

    selected_features_iloc = \
        np.where(np.abs(cu_shap_values).mean(0)>np.abs(cu_shap_values).mean(0).mean())[0]
    selected_features = [inputs_names[i] for i in selected_features_iloc]

    X, y = data[selected_features], data['y']

    # random forest depth and size
    n_estimators = 25
    max_depth = 10

    model = cuRF( max_depth = max_depth,
                  n_estimators = n_estimators,
                  random_state  = 0 )

    trained_RF = model.fit(X, y)

    # save
    #dump( trained_RF, 'RF.model')

    # to reload the model uncomment the line below
    #loaded_model = load('RF.model')

    return model.predict(X)[-1]

The following code is what I want to do:

results_df['forecast'] = df.rolling(1000, axis=1).apply(backtesting_algo)

However, as explained above, this is not supported yet. Besides, I tried with a list comprehension, which I was able to run it properly. However, it consumes 100% of the CPU with a subset of 100 daily observations and this maximum capacity CPU usage is not useful for me because I want to do this multiple times with monte-carlo-simulated prices.

I hope you can help!

Thanks in advance,

Regards,

José Carlos