cudaErrorInvalidResourceHandle after dispatching inclusive_scan kernel in cuda12

I updated my cuda and cudnn version to fit the gpu H20.

Environment:

  • cuda version: from 11.4.2 to 12.2.2
  • cudnn version: from 8.2.14 to 8.9.6.50
  • gpu: from A100 to H20

And then my code start to fail:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  after dispatching inclusive_scan kernel: cudaErrorInvalidResourceHandle: invalid resource handle
Received signal 6
#0 0x000002678216 base::debug::StackTrace::StackTrace()
#1 0x000002678729 base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7f2e7e99d980 <unknown>
#3 0x7f2e2238d018 __GI_raise
#4 0x7f2e22377527 __GI_abort
#5 0x7f2e22721919 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
#6 0x7f2e2272cf3a __cxxabiv1::__terminate()
#7 0x7f2e2272cfa5 std::terminate()
#8 0x7f2e2272d1f7 __cxa_throw
#9 0x00000139c0e2 thrust::cuda_cub::throw_on_error()
#10 0x00000158dbd3 thrust::system::detail::generic::shuffle_copy<>()
......

The code this error pointed to is here:

ThrustAllocator<cudaStream_t> thrust_allocator(
            GPUGraphTable<KeyType>::s_cuda_allocators[context.gpu_id], stream);
thrust::random::default_random_engine engine(context.shuffle_seed[tensor_pair_idx]);
const auto& exec_policy = thrust::cuda::par(thrust_allocator).on(stream);
thrust::shuffle_copy(
            exec_policy,
            cnt_iter,
            cnt_iter + context.total_row[tensor_pair_idx],
            thrust::device_pointer_cast(d_random_row),
            engine);

Then I printed the logs before it:

cudaStreamSynchronize(stream);
err = cudaGetLastError();
if (err != cudaSuccess) {
    LOG(NOTICE) << "[GPU_ID: " << context.gpu_id << "]CUDA Error before shuffle_copy: " << cudaGetErrorString(err);
} else {
    LOG(NOTICE) << "[GPU_ID: " << context.gpu_id << "]No error before shuffle_copy.";
}
// shuffle_copy calls

But it turns out that only gpu 0 got an error, all other gpu’s were normal.

I suspected a problem with stream, so I ran another empty kernel before these codes:

// define
__global__ test_run_kernel(int gpu_id){
    printf("[GPU_ID: %d] running in test_run_kernel!", gpu_id);
}
// ....

// before shuffle_copy
test_run_kernel<<<1, 1, 0, stream>>>(gpu_id);

// shuffle_copy calls

But it still failed because of cudaErrorInvalidResourceHandle.

Then I tried to use global stream:

test_run_kernel<<<1, 1>>>(gpu_id);

And it worked. Based on these phenomena, I suspected the problem is in stream. So I used cudaStreamQuery to detect it, but I got cudaSuccess every time.

If any more information is needed, please let me know in the comments section.

A stream has an inherent device (or context) association. The device or context that was active when it was created is the device it is intended to be used on, for stream-ordered work-issuance.

If a kernel launch works when issued into a default stream but fails with cudaErrorInvalidResourceHandle when issued into a particular stream, my guess would be that the stream and the currently active device (i.e. the device the kernel is being launched on) don’t match; the stream in question is associated with another device or context.

I wouldn’t be able to answer questions about why it fails based on the changes/updates you made. You might go back to the originally working code, and make those changes one at a time. For example, on the A100 machine, update from 11.4.2 to 12.2.2 only. Then on the A100 machine, update the cudnn version. Then switch to the H20 machine. Etc.

Thanks for the reply.

“the stream and the currently active device (i.e. the device the kernel is being launched on) don’t match; the stream in question is associated with another device or context.”
In fact, the kernel is called in a loop. On the first call, all my custom streams will succeed, and from the second call of the loop, the above error will occur.

If the stream doesn’t match the device as you guessed, I used cudaStreamQuery to detect this stream but got cudaSuccess all the time, isn’t this result not as expected?

Is it possible that there is some conflict between the H20’s cuda driver and my upgraded cuda/cudnn version?

I don’t think there are any requirements about what device is active when you issue cudaStreamQuery(). I indicated the association is important for

cudaStreamQuery() does not constitute stream-ordered work issuance.

Yes, anything is possible. I can’t state explicitly what is going on in your case.

1 Like

Thank you for your reply. I understand your point. I will focus on checking whether the stream matches the active device each time the kernel runs. Are there any relevant APIs I can use to debug this scenario?

I’m not aware of any API to test whether a given stream is associated with a given device. However using cudaGetDevice() and an array of device ordinals that is consistent with the array of created streams, it should be easy enough to do some careful checking.

At the point of cudaStreamCreate():

int devs[NUM_STREAMS];
cudaStream_t streams[NUM_STREAMS];
for (int i = 0;i<NUM_STREAMS;i++){
    cudaStreamCreate(streams+i);
    int d;
    cudaGetDevice(&d);
    devs[i] = d;}

at the point of stream usage:

for (int i = 0; i < ???; ???){
    int d;
    cudaGetDevice(&d);
    assert(d == devs[i])
     my_kernel<<<...,...,...,streams[i]>>>(...);

something like that. I’m not suggesting this is exactly what to do; its evident this code execution would all occur on the same device, but hopefully you get the idea. If you have your streams in an array, create a “parallel” array to keep track of the device the stream was created on. Then check which device you are on when you use that stream.

1 Like

I think with CUDA 12.8 a new runtime API function to query the device of a stream was added.
__host__​cudaError_t cudaStreamGetDevice ( cudaStream_t hStream, int* device )

2 Likes

I have solved this problem, thanks for your help.

But my CUDA version is 12.2 that is still not supported for this API.
And I have solved this problem by other ways, thanks for your reply!

Please tell how you solved the problem. So others can learn.

I was using cudaGetDevice API and found some inconsistencies with the active device, and finally realized that it was an error in switching the device in a dependency package, and it had nothing to do with the cuda/cudnn version or gpu type.

1 Like