Cudf spill-on-demand couldn't find memory device to spill: <SpillManager device_memory_limit= | spilled | unspilled>

Hello,

I ran into this:

[WARNING] RMM allocation of 2GiB failed, spill-on-demand couldn’t find any memory device to spill: <SpillManager device_memory_limit=372.53GiB | 763.09MiB spilled | 0B unspilled (unspillable)>

which occurred at this line of code:

cudf_df = cudf.DataFrame(data_dict)

where data_dict is:

data_dict = {
‘row_id’: <list of 15 Mil row ids>,
‘text’: <list of 15 Mil text records
}

which resulted in this error:

MemoryError: std:: bad_alloc : out_of_memory: CUDA error at …/lib//python3.10/site-packages/librmm/include/rmm/rm/device/cuda_memory_resource.hpp:62: cudaErrorMemoryAllocation out of memory

I have configured a LocalCUDACluster and enabled spill-on-demand:

LocalCUDACluster(
CUDA_VISIBLE_DEVICES=‘0,1,2,3,4,5,6,7’,
rmm_pool_size=0.9,
enable_cudf_spill=True,
device_memory_limit=400000000000,
cudf_spill_statistics=2,
local_directory=‘./cudf_spill_storage’,
log_spilling=False
)

What I notice in my log is the “RMM allocation failed ..” above came from function ‘_out_of_memory_handle’ which gives up after trying twice to spill via memory devices (buffers).

I also notice that SpillManager created only 3 buffers. Why only 3 buffers when there are plenty of memory available on the host? Is this number of 3 buffers a default setting somewhere? (I tried reading the source of SpillManager and SpillableBufferOwner classes but have not found it yet so any pointers are appreciated, thanks in advance).

I also tried disable spilling by setting ‘device_memory_limit=0’ (because of ample memory on host) but got exact same error. It seems to me the limit of only 3 buffers is the start of the failure regardless of setting of ‘device_memory_limit’.

Am I missing something? Any advices/pointers are greatly appreciated. Thank you.

Hi @pham_hung2,

Are you seeing this error whilst trying to use a NeMo Microservice? are you just using rapids/cudf?

Thanks,

Sophie

Hi Sophie,

Thanks for replying. I got that error while using rapids/cudf, NOT the microservice. My objective was loading a 15Mil records dataset using cudf and then prep it using Nemo Curator.

Thanks for your help with this.
Hung

Hi Hung,

I’ve chatted to the RAPIDS/cuDF team - please can you ask your question in the RAPIDS Slack - they can support you there.

Best,

Sophie