"unspecified launch failure" runtime failure

Hi everybody,

I have been working on a kernel that is giving me an unusual amount of trouble, and I wonder if someone can help me with it. Each time it runs, either the entire system hangs (requiring a reboot) or my C1060 returns a “unspecified launch failure” right after the kernel call. Here is the kernel (prec is typedef as double):

[codebox]extern shared prec shared;

global void stage(prec* rpx, prec* rpy, prec* rpz, prec* rfx, prec* rfy, prec* rfz)

{

// ***** VARIABLES INITIALIZATION *****

const unsigned long loop_size = blockDim.x;

const unsigned long loop_pos = blockIdx.x;

const unsigned long loc_pos = threadIdx.x;

const unsigned long rem_pos =  loop_size * loop_pos + loc_pos;



// ***** SHARED MEMORY INITIALIZATION *****

// allocate shared memory

prec* lpx = shared;

prec* lpy = shared + loop_size;

prec* lpz = shared + 2*loop_size;

prec* lfx = shared + 3*loop_size;

prec* lfy = shared + 4*loop_size;

prec* lfz = shared + 5*loop_size;

// copy values from device memory to shared memory

// each thread reads the data for one item

lpx[loc_pos] = rpx[rem_pos];

lpy[loc_pos] = rpy[rem_pos];

lpz[loc_pos] = rpz[rem_pos];

// initialize interactions results

lfx[loc_pos] = 0;

lfy[loc_pos] = 0;

lfz[loc_pos] = 0;

// synchronize threads

__syncthreads();



// ***** COMPUTE INTERACTIONS AMONG LOOP ELEMENTS *****

unsigned long it = loc_pos + 1;

unsigned long interactions = loop_size >> 1;

for(int i=0; i < interactions; ++i)

{

	// if the end is reached, loop back to the beginning

	if(it >= loop_size) it = 0;

	

	// compute interactions

	prec dist_x = lpx[it] - lpx[loc_pos];

	prec dist_y = lpy[it] - lpy[loc_pos];

	prec dist_z = lpz[it] - lpz[loc_pos];

	prec dist_n = hypot(dist_z, hypot(dist_x, dist_y));

	prec force = 1/dist_n/dist_n;

	

	// store the results to shared memory

	__syncthreads();

	lfx[loc_pos] = force*dist_x/dist_n;

	lfx[it] += force*dist_x/dist_n;

	lfy[loc_pos] -= force*dist_y/dist_n;

	lfy[it] += force*dist_y/dist_n;

	lfz[loc_pos] -= force*dist_z/dist_n;

	lfz[it] += force*dist_z/dist_n;

	

	// move on to next item

	++it;

}



// ***** SAVE RESULTS TO DEVICE MEMORY *****

// synchronize threads before writing

__syncthreads();

// copy results from shared memory to device memory

// each thread saves the data for one item

rfx[rem_pos] = lfx[loc_pos];

rfy[rem_pos] = lfy[loc_pos];

rfz[rem_pos] = lfz[loc_pos];

}[/codebox]

And that is how it is called:

stage<<<7, 128, 128*6*sizeof(prec)>>>(PX, PY, PZ, FX, FY, FZ);

I am sure that the device memory is correctly initialized (I tested it with other kernels), and the kernel works correctly in device emulation mode. The problem apparently happens when I am storing the results to shared memory (

lfx[loc_pos] = force*dist_x/dist_n;

). If I compile commenting the entire block, the cubin file for the kernel reports some 10 less registers used per thread, and the kernel runs successfully. This make me guess that the problem is related to the number of registers per thread, but I can’t understand why because I don’t think that I am in any way going past the registers-per-processor limit.

I am running CUDA 2.1 on a Fedora 10 box with the 180.22 driver. Any suggestion would be greatly appreciated ;)

Thanks for your time,

Claudio A.

Compile for -deviceemu, run through valgrind.

Compile with the flag --ptxas-options=-v to see the number of registers.

How many registers are you using?

thanks guys for replying!

here’s the nvcc --ptxas-options=-v -o parallel parallel.cu

output:

ptxas info : Compiling entry function ‘Z8stageIImmPdS_S_S_S_S

ptxas info : Used 39 registers, 80+80 bytes smem, 24 bytes cmem[1]

ptxas info : Compiling entry function ‘Z7stageIPdS_S_S_S_S

ptxas info : Used 38 registers, 80+80 bytes smem, 24 bytes cmem[1]

ptxas info : Compiling entry function ‘Z6stagePdS_S_S_S_S

ptxas info : Used 26 registers, 64+64 bytes smem, 24 bytes cmem[1]

It says 26 registers for stage (stageI and stageII are other functions, which I am not calling at all now.) Because each group requires 12868B+128*64B = 14,336B of shared memory per block, this kernel is shared memory-bound and there can be at most 128 threads per SM, which means 3,328 allocated registers on a total of 16,384.

Please let me know if I am doing any error in those calculations… I really don’t understand what I could be doing wrong <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

I have just installed the new toolkit (2.2) and driver (185.18.08), and now I get different errors: ‘The launch timed out and was terminated’, or, if I run my program in a tty that is not controlled by the X server, something like ‘NVRM: Xid (0004:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100’. Often, the application goes on running forever without printing any error at all.

Could it be a X server configuration-related issue? I have (on the same system) a GeForce GTX 280 that is running Screen0 and a Tesla C1060 that, according to nvidia-settings, doesn’t have any related X Screen (which makes sense). No matter what card I run my program on, I get the same errors, but sometimes when I run on the GeForce the mouse halts for a while.

Any ideas? Would it be helpful if I posted the nvidia-bug-report output?

Probably a timeout/watchdog issue or you’re running an infinite loop.

From my experience, it would probably be fastest to start debuging this from an empty kernel and start opening kernel code

one by one or chunk by chunk. That way you’ll be able to identify the offending code line/block.

Make sure that you dont get your kernel optimized out by the compiler while doing so.

Using valgrind is of course also an option :)

Still I found the method above to be the fastest.

eyal

Thanks for replying :)

The problem is inside the block:

lfx[loc_pos] -= force*dist_x/dist_n;

lfx[it] += force*dist_x/dist_n;

lfy[loc_pos] -= force*dist_y/dist_n;

lfy[it] += force*dist_y/dist_n;

lfz[loc_pos] -= force*dist_z/dist_n;

lfz[it] += force*dist_z/dist_n;

If I replace all the += and -= with just = the problem disappears (which made me originally think of a register problem). Same if I remove the block entirely.

I tried to compile in emulation mode and to run Valgrind, but no tool reports any error at all. I also tried cuda-gdb, but it says:

The second line shows the properties of the device obtained with cudaGetDevice (device used for execution) and it’s printed by the program. The problem is that, as you can see, the application runs on the Tesla, while X should be running on the GeForce only. Also, in my xorg.conf file there are no references to the second card (the Tesla). Is there anything that I can do to make 100% sure that X is not conflicting with the execution of my cuda kernel?