Write to device memory working in emulation only Write executes correctly on CPU but not on GPU

Hi everybody,

I wrote a kernel that:

  1. reads some values from the device memory

  2. copies them into the shared memory

  3. operates on the values in the shared memory

  4. copies the values from the shared memory back to the device memory.

It works perfectly in device emulation mode, but apparently it does nothing if running on a gpu. Here is my kernel:

extern __shared__ char local[];

__global__ void kernel(double* rlx, double* rfx)


	const long half_loop = blockDim.x;

	const long full_loop = half_loop << 1;

	const long prec_size = sizeof(double);

	const long this_pos = threadIdx.x;


	double* llx = (double*)&local[0];

	double* lfx = (double*)&local[full_loop*prec_size];


	llx[this_pos] = rlx[this_pos];

	llx[half_loop+this_pos] = rlx[half_loop+this_pos];

	lfx[this_pos] = 0;

	lfx[half_loop+this_pos] = 0;


	long item_pos = this_pos+1;

	for(long crosses=0; crosses < half_loop; ++crosses)


		if(item_pos == full_loop) item_pos = 0;


		lfx[this_pos] -= llx[this_pos]-llx[item_pos];

		lfx[item_pos] += llx[this_pos]-llx[item_pos];




	rfx[this_pos] = lfx[this_pos];

	rfx[half_loop+this_pos] = lfx[half_loop+this_pos];


and this is how I call it (with items=64, and both dlx and dfx are pointers to device memory):

kernel<<<1, items/2, items*2*sizeof(double)>>>(dlx, dfx);

I assume there might be a problem with the way I am using the shared memory, but I really can’t see what is wrong, especially because the CUDA environment does NOT return any runtime error.

I compiled my program with nvcc -m 64 -arch compute_13 -code sm_13 -o post post.cu. My machine runs Fedora 10 (x86_64) with gcc 4.3.2 on an Intel i7 (2.67GHz) with 6GB of RAM and two GPUs: a GeForce GTX280, which is running the display, and a Tesla C1060. I am using the 2.2 toolkit with the 185.18.08 driver from the NVIDIA website. I attached the complete program file for reference.

I would gladly appreciate any hint. Thanks for your time!
post.zip (1.28 KB)

I ran my application on devices in physically different machines, and it always failed the same way. So, I decided to pick up cuda-gdb and try debugging my kernel. I manually went through each single line, and when I ran line 69 the terminal running cuda-gdb halted beyond any hope of restoration (Ctrl-C wouldn’t work).

Am I using any software / hardware configuration (described in my previous post in the thread) known as being unstable? I tried many different CUDA applications, but all the ones that involve shared memory always failed on me. So, I am wondering if there could be a driver configuration error in all my machines (all Fedora 10, driver version 185.18.08) …

run -deviceemu and valgrind–you’re probably passing a host pointer to the device.

Thanks for answering, I really appreciate your attention to this matter.

If you comment line 99 (which is useless) and valgrind / memcheck the program, no memory errors are found:

cuda-dbg (program compiled WITHOUT -deviceemu) shows that the arrays are correctly inited and accessible before line 69, is that the behavior we expect if the pointers are pointing to nonsense areas? I tried to substitute lines 69 and 70 with constant value assignments (as in lfx[threadIdx.x] = 6;). Right after executing line 69, calling the debugger function print lfx[0] (for thread 0) reports an assigned value of 5.3671573228516165e-315. If I change all the declarations (both device and host) from double to float print lfx[0] has the correct value of 6, but the application still fails to perform operations (sums) on the shared memory.

I am really clueless about what is going on. The SDK examples run without reporting any errors, so apparently only my programs are faulty, but even the simplest program behaves incorrectly… is there any way to check whether the toolkit / driver install is sane? (but the SDK examples compiled, so I am puzzled).

yep, looks like a compiler bug. I’ve filed it and will let you know.

easy workaround: change const long half_loop to const int half_loop. seems to work for me…

Thank you very much! Following your tip makes the application run correctly. I really appreciate your support :thumbup: