XGBoost on WSL2 - Exception in gpu hist: NCCL failure

bwynnemorris · January 7, 2021, 2:23am

I am trying to train an XGBoost model on my WSL2-Ubuntu 20.04 setup. However, I get an error on my local machine when I set ‘tree_method’: ‘gpu_hist’ in params:

“Exception in gpu hist: NCCL failure : unhandled system error”

I can fit a model using ‘tree_method’: ‘hist’ although that means we’re now fitting on the CPU rather than the GPU, which defeats the purpose of using RAPIDS+Cuda! It was also very slow.

Note as part of your DLI course FUNDAMENTALS OF ACCELERATED DATA SCIENCE WITH RAPIDS , I was able to train an XGBoost model on the cloud-based GPU cluster (i.e. not my local notebook) with ‘tree_method’: ‘gpu_hist’ in params.

Any suggestions?

bwynnemorris · January 9, 2021, 2:03pm

For anyone else seeing this error, it seems that integration between RAPIDS and XGBoost is not supported on WSL2.

“For now please use Linux natively. We don’t have plan on supporting WSL, at least not on regular CI pipeline.”
Source: https://github.com/dmlc/xgboost/issues/6585

amannmalik · January 21, 2021, 7:31am

See NCCL tests don't work on WSL · Issue #442 · NVIDIA/nccl · GitHub
NCCL doesn’t work with WSL at the moment.

bwynnemorris · January 21, 2021, 2:08pm

Thanks @amannmalik - I understand this is why XGBoost doesn’t work on WSL2 in “multi gpu” mode. However, I am only trying to train XGBoost on a single GPU. Should NCCL matter if training on a single GPU?

amannmalik · January 22, 2021, 9:56pm

I haven’t gotten around actually debugging the library yet (I run into issues with it via PyTorch), but I believe some sort of exception is thrown when the library attempts to discover GPUs during initialization, using some of the NVML APIs that CUDA’s WSL support ostensibly does not fully support CUDA on WSL :: CUDA Toolkit Documentation
It might work for you if you compile XGBoost from source and exclude the NCCL support.