cuML, Decision Trees and Imbalanced Data

I have two concerns around some implementation procedures I’ve been tasked with.
First, I need to create decision trees with GPUs. Naturally, I went to cuML hoping for a comparable decision tree classifier for CUDA/GPUs from scikit-learn and alas, no such classifier exists in cuML. I am trying to understand why. Can someone please help me to understand this? (I could create a Random forest with one tree, and bootstrap=False and feature=n_features to simulate a decision tree, as per explained here: python - Why is Random Forest with a single tree much better than a Decision Tree classifier? - Stack Overflow)

A little perspective on why I’m trying to use a decision tree with a GPU… the idea is to create many decision trees, each training a different dataset and producing an output that will then be used for something else. We do not want to use a Random Forest model for each dataset because it will be too computationally expensive.

Okay, so maybe a little workaround to get that decision tree… next step, my data is imbalanced pretty badly. Trying to avoid SMOTE and undersampling so I was going to use class_weights. I mean, scikit-learn’s Random Forest classifier has class_weights as does most of the other classifiers for cuML, why not RF? And it was revealed class_weight is not an option for RF in cuML. Womp womp. Anyone else encountered this and found any work arounds?

Thanks for your insight. I’m banging my head against the wall trying to find a way to get this to happen without having to hardcode and create my own decision tree from scratch.

1 Like