Thrust exception: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal

Hello,

I have looked around but I couldn’t find relevant information. I am getting error like in the header.

Thrust exception: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal
CUDA Error detected. cudaErrorInvalidValue invalid argument
louvain_example: /home/yigithan/miniconda3/envs/cugraph_dev/include/rmm/mr/device/cuda_memory_resource.hpp:80: virtual void rmm::mr::cuda_memory_resource::do_deallocate(void*, std::size_t, rmm::cuda_stream_view): Assertion `status__ == cudaSuccess' failed.

You may suggest to talk developers of cugraph or rmm, but my other colleagues not having such a problem like that, even we have same cuda, and cugraph versions. So this made me think that problem might be on my settings?

My cuda version is: 12.4
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 22.04 LTS
DRIVER:550.127.08

librmm : 24.12.00a33 cuda12_241204_g3b5f6af2_33 rapidsai-nightly
rmm: 24.12.00a33 cuda12_py312_241204_g3b5f6af2_33 rapidsai-nightly

The examples that I tried (Examples are tested and working in other GPU’s/Github’s Pipeline);

Short definition of problem.

If I am working small dataset like karate there is no problem. It starts and finishes succesfully. But when I working with big datasets (ca-hollywood-2009, soc-livejournal), it initializes after that runs ~30-40 seconds and crashes.

I also ran with compute-sanitizer and got this results.

Program hit cudaErrorLaunchOutOfResources (error 701) due to "too many resources requested for launch" on CUDA API call to cudaLaunchKernel_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4466f5]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaLaunchKernel_ptsz [0x547fd]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel in /home/yigithan/miniconda3/envs/cugraph_dev/targets/x86_64-linux/include/cuda_runtime_api.h:14030 [0xe5ecef1]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
=========     Host Frame:_ZL736__device_stub__ZN7cugraph6detail35per_v_transform_reduce_e_mid_degreeILb1ENS_12graph_view_tIiiLb0ELb0EvEENS0_52edge_partition_endpoint_dummy_property_device_view_tIiEES5_NS0_42edge_partition_edge_property_device_view_tIiPKffEENS6_IiPKjbEEPfZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEEN3rmm14device_uvectorIT2_EERKN4raft8handle_tERKNS2_IT0_T1_XT3_EXT4_EvEENS_20edge_property_view_tISP_PKSI_N6thrust15iterator_traitsISV_E10value_typeEEEEUnvdl0_PFNSH_IfEESN_RKS3_NST_IiS8_fEEESF_ILb1EiifLb0ELb0EE2_NS_9reduce_op4plusIfEEfEEvNS_28edge_partition_device_view_tINSO_11vertex_typeENSO_9edge_typeEXsrSO_12is_multi_gpuEvEES1C_S1C_SP_SI_T3_NSW_8optionalIT4_EET5_T6_T8_S1L_T7_RN7cugraph28edge_partition_device_view_tIiiLb0EvEEiiRNS_6detail52edge_partition_endpoint_dummy_property_device_view_tIiEES6_RNS3_42edge_partition_edge_property_device_view_tIiPKffEERN6thrust8optionalINS7_IiPKjbEEEEPfR17__nv_dl_wrapper_tI11__nv_dl_tagIPFN3rmm14device_uvectorIfEERKN4raft8handle_tERKNS_12graph_view_tIiiLb0ELb0EvEENS_20edge_property_view_tIiS9_fEEEXadL_ZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEENSN_IT2_EESS_RKNST_IT0_T1_XT3_EXT4_EvEENSX_IS16_PKS13_NSC_15iterator_traitsIS1B_E10value_typeEEEEELj2EEJEEffRNS_9reduce_op4plusIfEE in /tmp/tmpxft_00006475_00000000-6_graph_weight_utils_sg_v32_e32.cudafe1.stub.c:233 [0xe5ee3f6]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
.
.
.
.

Yes, I am aware that is your filed issue. I’m posting this for others who may come across this thread. That is a mechanism to contact rapids development team.