Can't I launch kernel with the maxThreadsPerBlock of GTX260?

Hi…

The following code showed me the correct results while launching the kernel with thread number 256 per block on the GTX260:

CALL gpu_bilinear_interpolation<<<Ndem/256+1, 256>>>(Ndem)

But it failed with the max. threads per block 512. All the d_v(k) were zero.

CALL gpu_bilinear_interpolation<<<Ndem/512+1, 512>>>(Ndem)

Did I do anything wrong?
I use pvf 10.8 on the Win 7.

Thank you in advance.

module dev_bilinear
use cudafor
implicit none
	integer::done
	 real*8,allocatable,device :: d_bx(:,:), d_by(:,:), d_x(:), d_y(:)
	 real*8,allocatable,device :: d_bv(:,:), d_v(:)

contains
!============================================================================
!      The subroutine is to do the interpolation of station o using bilinear method
!============================================================================== 
	Attributes(global) SUBROUTINE gpu_bilinear_interpolation(Ndem)
	use cudadevice
	implicit none
	integer,value :: Ndem
	real*8,device ::  p1, p2, p3, p4, q1, q2, q3, q4
	real*8,device ::  wl1, wl3, d14, wx1, wx4, v1, v4
	logical,device  ::  ck, getv
	integer,device  ::  i, k
				
	k=threadidx%x+ (blockidx%x-1)*blockdim%x
	if( k <= Ndem )then
		getv=.false.
		do i=1, 4
			if (d_x(k) == d_bx(i,k) .AND. d_y(k) == d_by(i,k)) then
				d_v(k) = d_bv(i,k)
				getv=.true.
			end if
		end do

		IF( .not. getv )then	
			call gpu_get_intersection_of_two_lines(d_bx(1,k), d_by(1,k), d_bx(2,k), d_by(2,k), d_x(k), d_y(k), d_x(k), d_y(k)-1.d0, p1, q1, ck)
			call gpu_get_intersection_of_two_lines(d_bx(3,k), d_by(3,k), d_bx(4,k), d_by(4,k), d_x(k), d_y(k), d_x(k), d_y(k)+1.d0, p4, q4, ck)

			wl1 = dsqrt( (p1-d_bx(1,k))**2.d0 +(q1-d_by(1,k))**2.d0 ) /  dsqrt( (d_bx(2,k)-d_bx(1,k))**2.d0 + (d_by(2,k)-d_by(1,k))**2.d0 )
			wl3 = dsqrt( (p4-d_bx(3,k))**2.d0 +(q4-d_by(3,k))**2.d0 ) / dsqrt( (d_bx(4,k)-d_bx(3,k))**2.d0 + (d_by(4,k)-d_by(3,k))**2.d0 )
			d14 = dsqrt( (p1-p4)*(p1-p4) +(q1-q4)*(q1-q4))
			wx1 = dsqrt( (p1-d_x(k))**2.d0 + (q1-d_y(k))**2.d0)/d14
			wx4 = dsqrt( (p4-d_x(k))**2.d0 + (q4-d_y(k))**2.d0 )/d14

			v1 = wl1*(d_bv(2,k)-d_bv(1,k))+d_bv(1,k)
			v4 = wl3*(d_bv(4,k)-d_bv(3,k))+d_bv(3,k)
			d_v(k) = v1*wx4 + v4*wx1
		ENDIF
	else
		call syncthreads()
	end if
	END SUBROUTINE gpu_bilinear_interpolation
!======================================================================================
!      The subroutine is to get the coordinates of the intersection point of two lines
!========================================================================================
	Attributes(device) subroutine gpu_get_intersection_of_two_lines(x1, y1, x2, y2, x3, y3, x4, y4, x, y, ck)
	implicit none
		real*8,device  ::  x1, y1, x2, y2, x3, y3, x4, y4
		real*8,device ::  x, y
		logical,device  ::  ck
		real*8,device ::  m1, m3

		m1=1.0d+10
		m3=1.0d+10
		if (x1 /= x2)  m1 = (y1-y2)/(x1-x2)
		if (x3 /= x4)  m3 = (y3-y4)/(x3-x4)

		if (m1 == m3) then           !  Two lines are parallel
		   ck = .false.
		else
			ck = .true.
			if (x1 == x2) then           ! Line1 is vertical
			   x = x1
			   y = m3*(x-x4)+y4
			else if (x3 == x4) then      ! Line2 is vertical
					x = x3
					y = m1*(x-x2)+y2
			else
				x = (m1*x2-m3*x4+y4-y2)/(m1-m3)
				y = m1*(x-x2)+y2
			end if
		end if
	end subroutine gpu_get_intersection_of_two_lines

end module dev_bilinear

Hi cyfengMIT,

Can you add an error check after the kernel launch to see what the error is?

CALL gpu_bilinear_interpolation<<<Ndem/512+1, 512>>>(Ndem) 
ierr = cudaGetLastError()
if (ierr .ne. 0) then
   print *, cudaGetErrorString(ierr)
endif

Thanks,
Mat

Hi Mat,

I totally forgot the Error Handling APIs.
The error message is “too many resources requested for launch” and the parameter Ndem is 40,992.
How can I modify my code…?
Thank you in advance.

Feng

Hi Feng,

I’m guessing that you’re hitting another limit such as registers or shared memory. Can you try compiling with “-Mcuda=ptxinfo” to see what your usage is?

It looks like a GTX260 has 16,384 registers available per block or 32 per thread when there are 512 threads per block.

  • Mat

Hi Mat,

Yap… you’re right. I get the following message :

"ptxas info : Compiling entry function ‘gpu_bilinear_interpolation’ "
"ptxas info : Used 51 registers, 4+16 bytes smem, 504 bytes cmem[0], 8 bytes cmem[1] "

Does that mean that each thread uses 51 registers, larger than 32, when I launched 512 threads per block ?
Would you tell me how to calculate the amount of registers used in device subprogram?
Dose a single precision real variable use a register? And would you please give me some advice about modifying the code?

Thank you very much.

Feng


Hi Feng,

Does that mean that each thread uses 51 registers, larger than 32, when I launched 512 threads per block ?

This means that each thread will use 51 registers and thus limit you to 321 threads per block (i.e. 16,384/51), though I’d round this to a multiple of 32 (the warp size) or 320 threads.

The other thing you can do is use the flag “-Mcuda=maxregcount:32” to limit the number of registers used by each thread so you can run more threads. The caveat is that each thread will spill the extra registers to global memory making each thread run a bit slower.

I would try multiple configurations to see which one works best.

Would you tell me how to calculate the amount of registers used in device subprogram?

You can’t know exactly except from the ptxas information. Though, things like local scalars and temp variables used to store address calculations are often stored in registers. (The back-end CUDA tools do the actual register allocation). Basically, to use fewer registers, write smaller kernels.

Dose a single precision real variable use a register?

Possibly and double precision would use two.

And would you please give me some advice about modifying the code?

In your case, most of the register usage is coming from your local variables. If you can use single precision instead of double, you will probably get below the 32 registers per thread. You might be save a few if you manually inline “gpu_get_intersection_of_two_lines” since it’s local variables will also be put into registers and inlining will mean that you no longer need them. Finally, if you can reuse some of the local variables and eliminate others, then you can save a few more.

Though, do not make any of these changes if they compromise your algorithm. While performance is important, it is secondary to producing correct and reliable results.

  • Mat

Hi Mat,

I do really appreciate your advice.
I’ll try and keep them all in mind.
Thank you very much. :)

Feng.