Help with register/memory management

Hi,

I’ve been working with CUDA to develop a library of functions that can be used either directly by C++ or by Matlab via a simple MEX wrapper. This preamble isn’t all that important, except that I noticed that I had a problem when I ran a kernel from C++ without any problem, but it crashed horribly when run from Matlab. I narrowed the problem down to a simple multiplication operation in the kernel: it crashed if I multiplied two numbers stored in different memory locations; it didn’t crash if I just multiplied by a constant number of the same value. Looking at the ptxas output of both showed:

VS2008:

ptxas info	: Used 32 registers, 100+40 bytes lmem, 64+64 bytes smem, 144 bytes cmem[0], 44 bytes cmem[1]

Matlab:

ptxas info	: Used 36 registers, 80+40 bytes lmem, 64+64 bytes smem, 144 bytes cmem[0], 44 bytes cmem[1]

As soon as I set -maxrregcount=32 as a Matlab compile option, everything worked. I suppose this makes sense because I have 512 threads/block and my GTX285 has 16K registers. I thought that this would be taken care of automatically.

In any case, my experience made me realize that I don’t really know too much about efficient use of registers and the like, so my questions are:

  1. In general, I would appreciate any advice/tips on how to minimize register use or to use shared memory.

  2. In particular, can anyone help optimize my coordinate conversion kernel? Ideally, I would like to eliminate local memory access. As shown, the kernel takes an array of lat/lon/alt coordinates and converts them to earth-centred, earth-fixed coordinates (note that I am using doubles (compute 1.3) because of the necessary precision of the coordinate transform ).

static __global__ void kerLla2Ecef(double a, double e2, double3* lla, double3* ecef, int nCoords)

{	

	unsigned int xi = blockIdx.x * blockDim.x + threadIdx.x; 

	if(xi < nCoords)

	{

		double W2 = 1.0 - e2 * sin(lla[xi].x)*sin(lla[xi].x);

		double N = a / sqrt(W2);

		

		ecef[xi].x = ( N + lla[xi].z ) * cos( lla[xi].x ) * cos( lla[xi].y );

		ecef[xi].y = ( N + lla[xi].z ) * cos( lla[xi].x ) * sin( lla[xi].y );

		ecef[xi].z = ( (1-e2)*N + lla[xi].z ) * sin( lla[xi].x );		

	}

}

Thank you.