Unexpected behavior in CUDA square operation


I’m an experience CUDA developer, During the following week I have experienced a bug I have never seen before. 1st I’d like to mention that I work with overloaded CUDA systems , utilize most of the GPU capability and resources.

I was trying to calculate a normalization of imaginary number, during calculation I have 6464450 threads trying to execute square root operation (each block = 256 threads) , whenever I introduced a significant number of threads > 4506464 trying to execute square root at the same time, the calculation failed.
When I decrease the number of threads < 6464400 it worked, Finally I solved it using a grid step kernel.

I’d like to now whether or not there is any limitation for the number of threads capable of executing square root operation (or any other similar operation ) ?

As far as I know each CUDA SM has only four square root ALU units, which means that operation like this shall introduce a traffic jam but not failed operation.

Any suggestions , Ideas , People who experience the same problem

Help would be very appreciated


(1) Check the status of all CUDA API calls and all kernel launches rigorously. Ensure that your kernel launch checks capture all possible errors [e.g. code below)
(2) Run your application under cuda-memcheck. Also try the race check tool of cuda-memcheck.

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)