Greetings for the day!
I want to ask the following:
I have daily data from Apple stock as cudf.DataFrame type.
Then I want to run a random forest algo together with a shap feature selection.
I want to apply these algos daily. I know that the rolling.apply method doesn’t support the axis=1 yet, so I tried created a numba cuda kernel to do the whole backtesting for the entire dataset in parallel.
I tried using a list comprehension, but it actually uses the CPU, which I don’t want. I want to do all in GPU.
Could you help me, please?
This is the whole code:
The following code block impor the necessary libraries:
import cudf
import cuml
import cupy
import yfinance as yf
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import shap
%matplotlib inline
plt.style.use('seaborn-darkgrid')
## Model Interpretation package
import shap
shap.initjs()
from numba import cuda
from cuml import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score
from cupy import asnumpy
from joblib import dump, load
import warnings
warnings.filterwarnings("ignore")
The following block imports the data and prepare the features and the feature prediction:
df = cudf.DataFrame.from_pandas(yf.download('TSLA', start='2001-01-01', end='2022-12-31', auto_adjust=True))
inputs_names = [f'{column}_ret' for column in df.columns]
for column in df.columns:
df[f'{column}_ret'] = df[column].pct_change()
df['Close_lead'] = df.Close.shift(-1)
df.dropna(inplace=True)
df['y'] = cupy.where(df['Close_lead']>df['Close'],1,0)
df
The following code block is the function which I want to implement daily in the df dataframe:
def backtesting_algo(df):
data = df.copy()
X, y = df[inputs_names], df['y']
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.02, random_state = 0)
# random forest depth and size
n_estimators = 25
max_depth = 10
model = cuRF( max_depth = max_depth,
n_estimators = n_estimators,
random_state = 0 )
trained_RF = model.fit(X, y)
cu_explainer = cuml.explainer.KernelExplainer(model=trained_RF.predict,
data=X_test,
is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(X_test)
# save
dump(trained_RF, 'RF.model')
# to reload the model uncomment the line below
loaded_model = load('RF.model')
selected_features_iloc = \
np.where(np.abs(cu_shap_values).mean(0)>np.abs(cu_shap_values).mean(0).mean())[0]
selected_features = [inputs_names[i] for i in selected_features_iloc]
X, y = data[selected_features], data['y']
# random forest depth and size
n_estimators = 25
max_depth = 10
model = cuRF( max_depth = max_depth,
n_estimators = n_estimators,
random_state = 0 )
trained_RF = model.fit(X, y)
# save
#dump( trained_RF, 'RF.model')
# to reload the model uncomment the line below
#loaded_model = load('RF.model')
return model.predict(X)[-1]
The following code is what I want to do:
results_df['forecast'] = df.rolling(1000, axis=1).apply(backtesting_algo)
However, as explained above, this is not supported yet. Besides, I tried with a list comprehension, which I was able to run it properly. However, it consumes 100% of the CPU with a subset of 100 daily observations and this maximum capacity CPU usage is not useful for me because I want to do this multiple times with monte-carlo-simulated prices.
I hope you can help!
Thanks in advance,
Regards,
José Carlos