Hi everybody,
I have been working on a kernel that is giving me an unusual amount of trouble, and I wonder if someone can help me with it. Each time it runs, either the entire system hangs (requiring a reboot) or my C1060 returns a “unspecified launch failure” right after the kernel call. Here is the kernel (prec is typedef as double):
[codebox]extern shared prec shared;
global void stage(prec* rpx, prec* rpy, prec* rpz, prec* rfx, prec* rfy, prec* rfz)
{
// ***** VARIABLES INITIALIZATION *****
const unsigned long loop_size = blockDim.x;
const unsigned long loop_pos = blockIdx.x;
const unsigned long loc_pos = threadIdx.x;
const unsigned long rem_pos = loop_size * loop_pos + loc_pos;
// ***** SHARED MEMORY INITIALIZATION *****
// allocate shared memory
prec* lpx = shared;
prec* lpy = shared + loop_size;
prec* lpz = shared + 2*loop_size;
prec* lfx = shared + 3*loop_size;
prec* lfy = shared + 4*loop_size;
prec* lfz = shared + 5*loop_size;
// copy values from device memory to shared memory
// each thread reads the data for one item
lpx[loc_pos] = rpx[rem_pos];
lpy[loc_pos] = rpy[rem_pos];
lpz[loc_pos] = rpz[rem_pos];
// initialize interactions results
lfx[loc_pos] = 0;
lfy[loc_pos] = 0;
lfz[loc_pos] = 0;
// synchronize threads
__syncthreads();
// ***** COMPUTE INTERACTIONS AMONG LOOP ELEMENTS *****
unsigned long it = loc_pos + 1;
unsigned long interactions = loop_size >> 1;
for(int i=0; i < interactions; ++i)
{
// if the end is reached, loop back to the beginning
if(it >= loop_size) it = 0;
// compute interactions
prec dist_x = lpx[it] - lpx[loc_pos];
prec dist_y = lpy[it] - lpy[loc_pos];
prec dist_z = lpz[it] - lpz[loc_pos];
prec dist_n = hypot(dist_z, hypot(dist_x, dist_y));
prec force = 1/dist_n/dist_n;
// store the results to shared memory
__syncthreads();
lfx[loc_pos] = force*dist_x/dist_n;
lfx[it] += force*dist_x/dist_n;
lfy[loc_pos] -= force*dist_y/dist_n;
lfy[it] += force*dist_y/dist_n;
lfz[loc_pos] -= force*dist_z/dist_n;
lfz[it] += force*dist_z/dist_n;
// move on to next item
++it;
}
// ***** SAVE RESULTS TO DEVICE MEMORY *****
// synchronize threads before writing
__syncthreads();
// copy results from shared memory to device memory
// each thread saves the data for one item
rfx[rem_pos] = lfx[loc_pos];
rfy[rem_pos] = lfy[loc_pos];
rfz[rem_pos] = lfz[loc_pos];
}[/codebox]
And that is how it is called:
stage<<<7, 128, 128*6*sizeof(prec)>>>(PX, PY, PZ, FX, FY, FZ);
I am sure that the device memory is correctly initialized (I tested it with other kernels), and the kernel works correctly in device emulation mode. The problem apparently happens when I am storing the results to shared memory (
lfx[loc_pos] = force*dist_x/dist_n;
). If I compile commenting the entire block, the cubin file for the kernel reports some 10 less registers used per thread, and the kernel runs successfully. This make me guess that the problem is related to the number of registers per thread, but I can’t understand why because I don’t think that I am in any way going past the registers-per-processor limit.
I am running CUDA 2.1 on a Fedora 10 box with the 180.22 driver. Any suggestion would be greatly appreciated ;)
Thanks for your time,
Claudio A.