Unexpected behavior in CUDA square operation

Hey,

I’m an experience CUDA developer, During the following week I have experienced a bug I have never seen before. 1st I’d like to mention that I work with overloaded CUDA systems , utilize most of the GPU capability and resources.

I was trying to calculate a normalization of imaginary number, during calculation I have 6464450 threads trying to execute square root operation (each block = 256 threads) , whenever I introduced a significant number of threads > 4506464 trying to execute square root at the same time, the calculation failed.
When I decrease the number of threads < 6464400 it worked, Finally I solved it using a grid step kernel.

I’d like to now whether or not there is any limitation for the number of threads capable of executing square root operation (or any other similar operation ) ?

As far as I know each CUDA SM has only four square root ALU units, which means that operation like this shall introduce a traffic jam but not failed operation.

Any suggestions , Ideas , People who experience the same problem

Help would be very appreciated

S

(1) Check the status of all CUDA API calls and all kernel launches rigorously. Ensure that your kernel launch checks capture all possible errors [e.g. code below)
(2) Run your application under cuda-memcheck. Also try the race check tool of cuda-memcheck.

// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)