Using assert in CUDA code

Hi all

I implemented my own assert in CUDA code, by using the printf function available with the Fermi architecture.

#define cudaAssert(condition) \

  if (!(condition)) printf("Assertion %s failed!\n", #condition)

As you can see the code just complains if an assertion does not hold but it does not abort the kernel execution. How can I do that? Is there something like exit(-1) for CUDA kernels?

PTX has the [font=“Courier New”]trap[/font] instruction to abort a kernel. I don’t think it’s currently exposed in CUDA C so you need to use inline assembly.

Thx for the hint I quickly tried it out.

#define cudaAssert(condition) \

  if (!(condition)){ printf("Assertion %s failed!\n", #condition); asm("trap;"); }

The kernel then immediately terminates if an assertion is not satisfied. However the error printed on the command line is less helpful.

Cuda error: Error when invoking the kernel.: unspecified launch failure.

Also it does not print out the message anymore, I think this is because printf does not happen instantly but a little bit later (after the kernel successfully terminated?) I wonder if there is a way to do both, print an error message and exit the kernel.

Use a global flag which each thread reads at the beginning of the kernel. If the flag is set, have the threads exit gracefully. The assert can use an atomic operation to set the flag. That way once a thread hits an error, every subsequent newly spawned thread will just exit straight away without computation.

I put the flag in shared memory,
so it kills all threads of kernel in the same block.
I don’t use atomics, I want it to die with low overhead.

Often if one thread hits an error, they all will.

That’s a good technique that I also use in my programs.

I was thinking the other way around though: Instead of using printf, just have a global variable where you save the line number of the failed assertion.

Actually, that is a really good idea. Write out line, thread and block indices to global memory, then execute a trap instruction. Host side error handlers can then read back the global memory and report the error. On Fermi, using zero copy memory and executing a __threadfence_system() call before the trap would automagically ensure the assert condition were on the host when the kernel dies.

But if you make every thread check a flag in global memory before it starts
aren’t you slowing down code even when there are no errors?

#ifdef DEBUG




It won’t cause any warp divergence, every thread reads the same value non-atomically, and on Fermi, the combined effects of L1 and L2 cache mean the value won’t there won’t actually be reads from global memory very often, just fetches from cache. You could argue the same about any error checking code. After all, any error checking code when there are no errors is superfluous, right?

I guess I was not thinking particularly of Fermi.

Do we have any measurements about how effective Fermi cache are?

Yip not too worried abut burning a few processing cycles if it make debugging easier.

My concern was always having unneeded I/O. But again we agree better to have working code than still be trying

to figure out what went wrong.

If you are worried about the extra cycles use the trap approach. No extra cycles spent (apart from evaluation the condition inside the assertion of course) unless something actually goes wrong and the kernel returns an error.