Hey,
I’m an experience CUDA developer, During the following week I have experienced a bug I have never seen before. 1st I’d like to mention that I work with overloaded CUDA systems , utilize most of the GPU capability and resources.
I was trying to calculate a normalization of imaginary number, during calculation I have 6464450 threads trying to execute square root operation (each block = 256 threads) , whenever I introduced a significant number of threads > 4506464 trying to execute square root at the same time, the calculation failed.
When I decrease the number of threads < 6464400 it worked, Finally I solved it using a grid step kernel.
I’d like to now whether or not there is any limitation for the number of threads capable of executing square root operation (or any other similar operation ) ?
As far as I know each CUDA SM has only four square root ALU units, which means that operation like this shall introduce a traffic jam but not failed operation.
Any suggestions , Ideas , People who experience the same problem
Help would be very appreciated
S