Hi,
I’ve been working with CUDA to develop a library of functions that can be used either directly by C++ or by Matlab via a simple MEX wrapper. This preamble isn’t all that important, except that I noticed that I had a problem when I ran a kernel from C++ without any problem, but it crashed horribly when run from Matlab. I narrowed the problem down to a simple multiplication operation in the kernel: it crashed if I multiplied two numbers stored in different memory locations; it didn’t crash if I just multiplied by a constant number of the same value. Looking at the ptxas output of both showed:
VS2008:
ptxas info : Used 32 registers, 100+40 bytes lmem, 64+64 bytes smem, 144 bytes cmem[0], 44 bytes cmem[1]
Matlab:
ptxas info : Used 36 registers, 80+40 bytes lmem, 64+64 bytes smem, 144 bytes cmem[0], 44 bytes cmem[1]
As soon as I set maxrregcount=32 as a Matlab compile option, everything worked. I suppose this makes sense because I have 512 threads/block and my GTX285 has 16K registers. I thought that this would be taken care of automatically.
In any case, my experience made me realize that I don’t really know too much about efficient use of registers and the like, so my questions are:

In general, I would appreciate any advice/tips on how to minimize register use or to use shared memory.

In particular, can anyone help optimize my coordinate conversion kernel? Ideally, I would like to eliminate local memory access. As shown, the kernel takes an array of lat/lon/alt coordinates and converts them to earthcentred, earthfixed coordinates (note that I am using doubles (compute 1.3) because of the necessary precision of the coordinate transform ).
static __global__ void kerLla2Ecef(double a, double e2, double3* lla, double3* ecef, int nCoords)
{
unsigned int xi = blockIdx.x * blockDim.x + threadIdx.x;
if(xi < nCoords)
{
double W2 = 1.0  e2 * sin(lla[xi].x)*sin(lla[xi].x);
double N = a / sqrt(W2);
ecef[xi].x = ( N + lla[xi].z ) * cos( lla[xi].x ) * cos( lla[xi].y );
ecef[xi].y = ( N + lla[xi].z ) * cos( lla[xi].x ) * sin( lla[xi].y );
ecef[xi].z = ( (1e2)*N + lla[xi].z ) * sin( lla[xi].x );
}
}
Thank you.