cudaGetLastError. Which kernel execution raised it?

Hi all

I have implemented a pipeline where many kernels are launched in a specific stream. The kernels, as most of you already know, are enqueued into the stream and executed when the scheduler decides it’s best.

In my code, after every kernel enqueue, I check if there’s any error by calling cudaGetLastError which, according to the documentation, “it returns the last error from a runtime call. This call, may also return error codes from previous asynchronous launches”. Thus, if the kernel has only been enqueued, not executed, I understand that the error returned refers only if the kernel was enqueued correctly (parameters checking, grid and block size, shared memory, etc…).

My problem is: I enqueue many different kernels without waiting for finalization of the execution of each kernel. Imagine now, I have a bug in one of my kernels (let’s call it Kernel1) which causes a illegal memory access (for instance). If I check the cudaGetLastError right after enqueuing it, the return value is success because it was correctly enqueued. So my CPU thread moves on and keep enqueuing kernels to the stream. At some point Kernel1 is executed and raised the illegal memory access. Thus, next time I check for cudaGetLastError I will get the cuda error but, by that time, the CPU thread is another point forward in the code. Consequently, I know there’s been an error, but I have no idea which kernel raised it.

An option is to synchronize (block the CPU thread) until the execution of every kernel have finished and then check the error code, but this is not an option for performance reasons.

The question is, is there any way we can query which kernel raised a given error code returned by cudaGetLastError? If not, which is in your opinion the best way to handle this?

Thanks in advanced

There is no way to query which kernel gave a particular asynchronous error.

I would do rigorous error checking (unlike the answer you received for your question on stack overflow) and one of the following:

  1. For debug purposes, have a debug switch that puts a cudaDeviceSynchronize() after every kernel launch. This will localize/isolate the error reporting. Turn this behavior on when you need it with a debug switch in your code.

  2. Build your code with the -lineinfo option, and when you have a failure, run the code again with cuda-memncheck, and follow the methodology here:

[url]cuda - Unspecified launch failure on Memcpy - Stack Overflow

which may give you enough error localization.

Thanks for your answer.

The stackoverflow answer is not a bad compromise for me. Of course the syncronization may shadow other problems, but for the purpose of catching the error, I may be able to reproduce the error in a production system with practically no changes. I don’t fully dislike the answer.

Regarding your suggestions, I believe I can’t go for option one. I’m trying to catch (possible) errors in a production system (I didn’t say it on the question, sorry). Option 1 forces me to recompile.

Regarding option 2, can I build in release with the -lineinfo flag? If I can, does it affects performance? If it does not, does memcheck will able to report the line error even in release?

thanks for your time

yes, you can build in release with -lineinfo (in fact, -G and -lineinfo are mutually exclusive switches)

no it should not affect performance

yes cuda-memcheck will still be able to report the offending line number this way

A thought: Insert a callback after each kernel call. In the userData parameter, include a unique id for the kernel-call, and possibly some information on the parameters used.

This is a compromise between device-syncs and standard error checking.

There’s still no guarantee the status in the callback will be directly from the kernel, but it should get pretty close. Especially if you’re only using one stream.

Performance impact should be minimal. It depends on how long the callback runs … in the vast majority of cases where the status is cudaSuccess the callback could be almost a no-op.

Thanks, I’ll try this.

This sounds like a great idea. I’ll do some tests.

Why do you say there’s no guarantee that the status from the callback will be directly from the kernel? I thought the kernel callback was call immediately after the kernel finishes… Is there some race condition I’m not seeing?

Additional note: I use 1 stream per thread

I did some tests with the callBack approach. As far as I test, the callback function satisfies all my needs except performance.

I wrote simple program to test it:

#include <cuda_runtime.h>

#include <vector>
#include <chrono>
#include <iostream>

#define BLOC_SIZE       1024
#define NUM_ELEMENTS    BLOC_SIZE * 32
#define NUM_ITERATIONS  500

__global__ void KernelCopy(const unsigned int *input, unsigned int *result) {
  unsigned int pos = blockIdx.x * BLOC_SIZE + threadIdx.x;
  result[pos] = input[pos];
}

void CUDART_CB myStreamCallback(cudaStream_t stream, cudaError_t status, void *data) {
  if (status) {
    std::cout << "Error: " << cudaGetErrorString(status) << "-->";
  }
}

#define CUDA_CHECK_LAST_ERROR   cudaStreamAddCallback(stream, myStreamCallback, nullptr, 0)

int main() {
  cudaError_t c_ret;
  c_ret = cudaSetDevice(0);
  if (c_ret != cudaSuccess) {
    return -1;
  }

  unsigned int *input;
  c_ret = cudaMalloc((void **)&input, NUM_ELEMENTS * sizeof(unsigned int));
  if (c_ret != cudaSuccess) {
    return -1;
  }

  std::vector<unsigned int> h_input(NUM_ELEMENTS);
  for (unsigned int i = 0; i < NUM_ELEMENTS; i++) {
    h_input[i] = i;
  }

  c_ret = cudaMemcpy(input, h_input.data(), NUM_ELEMENTS * sizeof(unsigned int), cudaMemcpyKind::cudaMemcpyHostToDevice);
  if (c_ret != cudaSuccess) {
    return -1;
  }

  unsigned int *result;
  c_ret = cudaMalloc((void **)&result, NUM_ELEMENTS * sizeof(unsigned int));
  if (c_ret != cudaSuccess) {
    return -1;
  }

  cudaStream_t stream;
  c_ret = cudaStreamCreate(&stream);
  if (c_ret != cudaSuccess) {
    return -1;
  }

  std::chrono::steady_clock::time_point start;
  std::chrono::steady_clock::time_point end;

  start = std::chrono::steady_clock::now();
  for (unsigned int i = 0; i < 500; i++) {
    dim3 grid(NUM_ELEMENTS / BLOC_SIZE);
    KernelCopy <<< grid, BLOC_SIZE, 0, stream >>> (input, result);
    CUDA_CHECK_LAST_ERROR;
  }
  cudaStreamSynchronize(stream);
  end = std::chrono::steady_clock::now();
  std::cout << "With callback took (ms): " << std::chrono::duration<float, std::milli>(end - start).count() << '\n';

  start = std::chrono::steady_clock::now();
  for (unsigned int i = 0; i < 500; i++) {
    dim3 grid(NUM_ELEMENTS / BLOC_SIZE);
    KernelCopy <<< grid, BLOC_SIZE, 0, stream >>> (input, result);
    c_ret = cudaGetLastError();
    if (c_ret) {
      std::cout << "Error: " << cudaGetErrorString(c_ret) << "-->";
    }
  }
  cudaStreamSynchronize(stream);
  end = std::chrono::steady_clock::now();
  std::cout << "Without callback took (ms): " << std::chrono::duration<float, std::milli>(end - start).count() << '\n';

  c_ret = cudaStreamDestroy(stream);
  if (c_ret != cudaSuccess) {
    return -1;
  }
  c_ret = cudaFree(result);
  if (c_ret != cudaSuccess) {
    return -1;
  }
  c_ret = cudaFree(input);
  if (c_ret != cudaSuccess) {
    return -1;
  }

  return 0;
}

Ouput:
With callback took (ms): 47.8729
Without callback took (ms): 1.9317

CUDA 9.2
Windows 10
Visual Studio 2015
Tesla P4

The performance impact of the callback function is huge. I read that Callbacks are processed by a driver thread. This may be the cause for the performance impact.

Are my numbers correct or am I missing something? Can someone in Nvidia clarify this?

Thanks in advance

On a Tesla P100 on linux on CUDA 10, with your code, I see 17.9ms with callback and 2.9ms without.

So the cost for each is around 30us. Given that this is a stream function with interaction with the host, this doesn’t strike me as incredibly large. You have a kernel running on the device. A host thread must poll something for completion of that. Once completion is observed, the host thread runs a callback routine. Once that routine finishes, the host thread signals that the next kernel may be launched. 30us doesn’t seem ridiculously large for that, given a basic operation overhead of something like 1-5us for the cuda runtime.

I assume your GPU is in TCC mode.

30us overhead per kernel launch seems incredibly large when your kernels run for ~5us. If your kernels run for 1 second, 30us overhead is in the noise.

If you don’t like a particular aspect of CUDA behavior, I suggest you file a bug at developer.nvidia.com. THe specific istructions are linked to a question at the top of this forum.

I’m just being cautious. The docs don’t explicitly say it works that way, so I’m trying to avoid any assumptions. (My guess is that same as yours. That, barring a device-level error or some such, it’s for the stream. But that’s just a guess.)

That’s unexpected. I use callbacks heavily and see nowhere near that hit.

Are you using the TCC driver or the WDDM driver?

Yes, TCC. I don’t believe Tesla P4 can be set to WDDM.

Absolutely agree. Unfortunately, in our system, we enqueue many kernels and 91us overhead each (in my system), ends up with 5-6 ms delay, which under our specifications is too much.

I will. Thanks

In the application I’m working on, the callbacks are required. So I never measured with vs without callbacks.

I knew up front that queuing many small kernels would lead to a big performance hit. So the design built in flexibility in the amount of work assigned to each kernel. Testing with queuing many small kernels confirmed the expected behavior.

I’d assumed the hit was from the many kernel invocations. But it could have been in the callback mechanism. Or both. nvvp (nvidia visual profiler) doesn’t show callback overhead, so I’d never considered this.

Please let me know whether or not you file a bug report on callback overhead.