32x32 block size problem

Hi,

I am currently writing some simple cuda fortran code (I’m just learning cuda fortran) and came across some strange behavior with regard to the grid. The card I am using is a GTX580 which should allow a block size of 32x32.

My code works fine for almost any block size (e.g., 8x8, 8x16, 8x32, 15x32, 16x32). However, when I choose something larger than 16x32 (e.g., 17x32), the code returns all zeros. I did some debugging but just can’t seem to figure out why this should happen.

Is there something I am missing with this card, that I actually cannot go beyond the block size of 512?

Any help is greatly appreciated.

Jan

Ps:
The code just computes some matrix, where I figure out the matrix indexing by

  i  = (blockidx%x-1) * blockdim%x + threadidx%x                                                             
  j = (blockidx%y-1) * blockdim%y + threadidx%y

And my grid is defined:

dimGrid = dim3( NX/NBLX, NY/NBLY, 1)                                                                     
dimBlock = dim3( NBLX, NBLY, 1 )

NX and NY set size of matrix and NBLX, NBLY set the block size.

Hi Jan,

What is the error message returned after the kernel launch? My guess you’re hitting some other limit such as the number of registers, shared memory, etc.

  • Mat

To check the error status of a kernel:

call somekernel{{{blocks, threads>>>(dA, dB, dC)  ! replace { with <
ierr = cudaGetLastError()
if (ierr .ne. 0) then
   print *, cudaGetErrorString(ierr)
endif

OK thanks. The error I get is

“too many resources requested for launch”

So it appears your guess is correct.
When I add
-Mcuda=maxregcount:32
the code runs fine (no error and correct answer). Does it make sense to set this limit or is it better to limit the block size?

Jan

Does it make sense to set this limit or is it better to limit the block size?

Increasing the number of active threads (i.e. the Occupancy), can lead to better performance simply because you are utilising the GPU more. However, if the cost means each thread needs to make more fetches from global memory, which is what happens when you restrict the number of registers per thread, it may negate the improvement.

Schedule tuning is a bit of a black art so it’s best to try several and see what works for the particular kernel. Be sure to use profiling, either via pgcollect/PGDBG or setting the environment flag “CUDA_PROFILE=1”, to gauge what works best.

  • Mat