Embarassingly beginner question

Hi all,

So I have added the following code to a project of mine. This is the first cuda function of many I expect to add. Well, I added it, everything seems great, except when comparing the answer to my standard C code, I find out that the answer has changed! (Significantly - this is not due to a difference in floating point standard…)

I ran the code through cuda-gdb (and I have an 8800GTS, so I couldn’t see what was actually going on the kernel side, just before and after), and it showed me that the answers (modified members of a structure) were NOT changed at all. This finding seems to make sense with what my results show. This lead me to believe that my cudaMemcpy needs work, but that looks fine to me.

Does anyone have any ideas/suggestions? Because this is a large piece of code I will post 1) the kernel function and the data loading and unloading functions, and 3) the piece of my code where it is called.

void loadneutlist(unsigned int listlength, neutron* host, neutron* device)


	//Assumes cudaMalloc already run for the lists

	cudaMemcpy(device, host, listlength*sizeof(neutron), cudaMemcpyHostToDevice);


void getneutlist(unsigned int listlength, neutron* host, neutron* device)


	//Assumes cudaMalloc already run for the lists

	cudaMemcpy(host, device, listlength*sizeof(neutron), cudaMemcpyDeviceToHost);


__global__ void gpuscatter_iso(neutron* elist_d)

//This calcs a new u,v,w, and energy after an isotropic collision


	int A=mat_list_d[elist_d[threadIdx.x].cell][elist_d[threadIdx.x].target_nuclide].zaid/1000;


	float mu_cm=gpurng(&elist_d[threadIdx.x].seed,2.0)-1;

	float new_energy=elist_d[threadIdx.x].energy*(A*A+2*A*mu_cm+1)/((float)(A*A+2*A+1));

	float temp=sqrt(elist_d[threadIdx.x].energy/new_energy);

	float cos_phi=cos(atan(sin(acos(mu_cm))/(1.0/A+mu_cm)));

	float sin_phi=sin(acos(cos_phi));

	float cos_w=gpurng(&elist_d[threadIdx.x].seed,2.0)-1;

	float sin_w=sin(acos(cos_w));

	temp=sin_phi/(sqrt(1-(elist_d[threadIdx.x].oz)*(elist_d[threadIdx.x].oz)));//reused to save space

	float new_u=temp*((elist_d[threadIdx.x].oy)*sin_w-(elist_d[threadIdx.x].oy)*(elist_d[threadIdx.x].ox)*cos_w)+(elist_d[threadIdx.x].ox)*cos_phi;

	float new_v=temp*(-(elist_d[threadIdx.x].ox)*sin_w-(elist_d[threadIdx.x].oz)*(elist_d[threadIdx.x].oy)*cos_w)+(elist_d[threadIdx.x].oy)*cos_phi;

	float new_w=sin_phi*sqrt(1-(elist_d[threadIdx.x].oz)*(elist_d[threadIdx.x].oz))*cos_w+(elist_d[threadIdx.x].oz)*cos_phi;

	temp =new_u*new_u+new_v*new_v+new_w*new_w;

	if (temp>1.0)











loadneutlist(elength, elist, elist_d);


				getneutlist(elength, elist, elist_d);

After this point elist is treated as the result and elist_d is unnecessary.

Any help is GREATLY appreciated,


What is elength?

If you try to launch more than 512 threads in a single block, the kernel will fail to launch at all. You can detect such a condition using cudaGetLastError().

The first thing you should do is check the return codes on your cudaMemcpy() to see if they were successful. You should also put a cudaThreadSynchronize() after your kernel call and check its return code to see if your kernel even ran. (Note: cudaThreadSynchronize() is not required for correct operation, but it provides a convenient way to check if there was a launch failure before you get to the cudaMemcpy() during the debugging stage.)

RIght, at this stage im keeping elength small, <100

You’re right, I’m getting an invalid device pointer message from my very first cudaMalloc:

neutron* elist_d;


In the job I just ran, nbatch was only 100. But even so, that can only get so large it takes up all of my GPUs memory, and then I would receive a message, correct?

Actually, it wasnt at the malloc elist_d, it was before that that the error msg came from (I forgot about the cudamallocs and memcpys i have in an initial function -its early…), but I tracked down that error (it was really just a switch in the arguments of cudamemcpy, i had it as host pointer then device pointer.

But now I find that the error is: Cuda error: kernel invocation: unspecified launch failure.

(I used the cudaSAFEerror from Dr Dobbs, if that looks familiar).

That happens right after i synchro threads, which is right after my kernel call.

OK, unspecified launch failure is sort of the generic “segmentation fault” of CUDA, which means you probably have a memory access violation somewhere. A likely culprit is this 2D array mat_list_d. 2D arrays implemented this way are difficult to get right in CUDA. How do you load data into mat_list_d, and where is it defined?

You’re right, that seems to be my problem. I had an extra * in a line, and I was telling CUDA to look for a pointer when I actually wanted it to be looking for a value. I revised the code, but I don’t have access to a cuda compiler here at work. Ii’ll have to wait and see what happens when I get home.

Thanks again

Hmmm, well I may or may not have been doing the 2-d arrays correctly (probably not). But, I converted the cuda part of my code to working with just a 1-d array (will be a big pain in the butt to convert all of the code), and I found out that after doing that, it STILL wasn’t working. Of course, I was confused. Well, then I found that defining my pointers as globals with device or constant was the final culprit. For as soon as I removed the device/constant flags, the code worked!

And, good news: now I get the correct answer. Does anyone know why I couldnt use the variable flags? I didn’t segregate if the problem was with using device or constant, but the one variable declared constant wasnt even used in the little CUDA part of the code I have written yet, so the constant likely wasnt the problem.

Thanks all