Ok, I’ve got a Geforce GTX 275 now. Code doesn’t work though (results are crap), but no errors are signaled by CUDA.
Driver v186.18
CUDA v2.2
Is there anything obviously wrong about this:
#define BLOCKSIZE 512
__global__ static
void pj_inv_pre_kernel (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra)
{
const unsigned int xIdx = blockIdx.x * blockDim.x + threadIdx.x;
x [xIdx] = (lam [xIdx] * to_meter - x0) * ra;
y [xIdx] = (phi [xIdx] * to_meter - y0) * ra;
}
void pj_inv_pre_gpu (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra, unsigned int nPoints)
{
pj_inv_pre_kernel<<<(nPoints + BLOCKSIZE - 1) / BLOCKSIZE, BLOCKSIZE>>> (lam, phi, x, y, to_meter, x0, y0, ra);
}
void pj_inv_stream (CPJParams* P)
{
pj_inv_pre_gpu (coords.xIn (), coords.yIn (), coords.xOut (), coords.yOut (), P->to_meter, P->x0, P->y0, P->ra, coords.Length ());
}
I have implemented a helper class doing most of the work of managing CPU and GPU side buffers. Please ask back if you need more information or code.
I have to admit that I haven’t really understood how to determine optimal grid and thread parameters for processing an array of some arbitrary length.
Edit:
Changing the code as follows doesn’t help:
#define BLOCKDIM 16
void pj_inv_pre_gpu (double* lam, double* phi, double* x, double* y, double to_meter, double x0, double y0, double ra, unsigned int nPoints)
{
dim3 dimBlock (BLOCKDIM, BLOCKDIM);
dim3 dimGrid ((nPoints + BLOCKDIM - 1) / BLOCKDIM, (nPoints + BLOCKDIM - 1) / BLOCKDIM);
pj_inv_pre_kernel<<<dimGrid, dimBlock>>> (lam, phi, x, y, to_meter, x0, y0, ra);
}
It looks like no results are being written to the output buffers.
Actually this stuff looks pretty straightforward, judging from the user guide and the SDK examples. What the heck am I doing wrong?