"max threads exceeded" error isn't reported

This isn’t a big deal, but shouldn’t
CUDA_CHECK_ERROR(cudaThreadSynchronize())
report trivial errors such as “max threads per block exceeded?”

There are two kind of errors that can occur with kernel launches:

(1) configuration issues (like exceeding the thread limit), reported synchronously prior to launching the kernel

(2) runtime errors (in particular, “unspecified launch failure”), reported asynchronously after launching the kernel

To catch both types, you can use code like the following (note this kind of macro is usually unsuitable for production code, which can’t just exit when things go wrong, but it demonstrates the principle):

// Macro to catch CUDA errors in kernel launches

#define CHECK_LAUNCH_ERROR()                                          \

do {                                                                  \

    /* Check synchronous errors, i.e. pre-launch */                   \

    cudaError_t err = cudaGetLastError();                             \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString(err) );       \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \

    err = cudaThreadSynchronize();                                    \

    if (cudaSuccess != err) {                                         \

        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\

                 __FILE__, __LINE__, cudaGetErrorString( err) );      \

        exit(EXIT_FAILURE);                                           \

    }                                                                 \

} while (0)