Embarassingly beginner question

laxsu19 · May 21, 2009, 1:41am

Hi all,

So I have added the following code to a project of mine. This is the first cuda function of many I expect to add. Well, I added it, everything seems great, except when comparing the answer to my standard C code, I find out that the answer has changed! (Significantly - this is not due to a difference in floating point standard…)

I ran the code through cuda-gdb (and I have an 8800GTS, so I couldn’t see what was actually going on the kernel side, just before and after), and it showed me that the answers (modified members of a structure) were NOT changed at all. This finding seems to make sense with what my results show. This lead me to believe that my cudaMemcpy needs work, but that looks fine to me.

Does anyone have any ideas/suggestions? Because this is a large piece of code I will post 1) the kernel function and the data loading and unloading functions, and 3) the piece of my code where it is called.

void loadneutlist(unsigned int listlength, neutron* host, neutron* device)

{

	//Assumes cudaMalloc already run for the lists

	cudaMemcpy(device, host, listlength*sizeof(neutron), cudaMemcpyHostToDevice);

}

void getneutlist(unsigned int listlength, neutron* host, neutron* device)

{

	//Assumes cudaMalloc already run for the lists

	cudaMemcpy(host, device, listlength*sizeof(neutron), cudaMemcpyDeviceToHost);

}

__global__ void gpuscatter_iso(neutron* elist_d)

//This calcs a new u,v,w, and energy after an isotropic collision

{

	int A=mat_list_d[elist_d[threadIdx.x].cell][elist_d[threadIdx.x].target_nuclide].zaid/1000;

	A=mat_list_d[elist_d[threadIdx.x].cell][elist_d[threadIdx.x].target_nuclide].zaid-1000*A;

	float mu_cm=gpurng(&elist_d[threadIdx.x].seed,2.0)-1;

	float new_energy=elist_d[threadIdx.x].energy*(A*A+2*A*mu_cm+1)/((float)(A*A+2*A+1));

	float temp=sqrt(elist_d[threadIdx.x].energy/new_energy);

	float cos_phi=cos(atan(sin(acos(mu_cm))/(1.0/A+mu_cm)));

	float sin_phi=sin(acos(cos_phi));

	float cos_w=gpurng(&elist_d[threadIdx.x].seed,2.0)-1;

	float sin_w=sin(acos(cos_w));

	temp=sin_phi/(sqrt(1-(elist_d[threadIdx.x].oz)*(elist_d[threadIdx.x].oz)));//reused to save space

	float new_u=temp*((elist_d[threadIdx.x].oy)*sin_w-(elist_d[threadIdx.x].oy)*(elist_d[threadIdx.x].ox)*cos_w)+(elist_d[threadIdx.x].ox)*cos_phi;

	float new_v=temp*(-(elist_d[threadIdx.x].ox)*sin_w-(elist_d[threadIdx.x].oz)*(elist_d[threadIdx.x].oy)*cos_w)+(elist_d[threadIdx.x].oy)*cos_phi;

	float new_w=sin_phi*sqrt(1-(elist_d[threadIdx.x].oz)*(elist_d[threadIdx.x].oz))*cos_w+(elist_d[threadIdx.x].oz)*cos_phi;

	temp =new_u*new_u+new_v*new_v+new_w*new_w;

	if (temp>1.0)

	{

		temp=1/temp;

		new_u=new_u*temp;

		new_v=new_v*temp;

		new_w=new_w*temp;

	}

	elist_d[threadIdx.x].ox=new_u;

	elist_d[threadIdx.x].oy=new_v;

	elist_d[threadIdx.x].oz=new_w;

	elist_d[threadIdx.x].energy=new_energy;

}

loadneutlist(elength, elist, elist_d);

				gpuscatter_iso<<<1,elength>>>(elist_d);

				getneutlist(elength, elist, elist_d);

After this point elist is treated as the result and elist_d is unnecessary.

Any help is GREATLY appreciated,

Thanks!

Jamie_K · May 21, 2009, 2:30am

What is elength?

If you try to launch more than 512 threads in a single block, the kernel will fail to launch at all. You can detect such a condition using cudaGetLastError().

seibert · May 21, 2009, 2:33am

The first thing you should do is check the return codes on your cudaMemcpy() to see if they were successful. You should also put a cudaThreadSynchronize() after your kernel call and check its return code to see if your kernel even ran. (Note: cudaThreadSynchronize() is not required for correct operation, but it provides a convenient way to check if there was a launch failure before you get to the cudaMemcpy() during the debugging stage.)

laxsu19 · May 21, 2009, 10:21am

RIght, at this stage im keeping elength small, <100

laxsu19 · May 21, 2009, 10:34am

You’re right, I’m getting an invalid device pointer message from my very first cudaMalloc:

neutron* elist_d;

cudaMalloc((void**)&elist_d,nbatch*sizeof(neutron));

In the job I just ran, nbatch was only 100. But even so, that can only get so large it takes up all of my GPUs memory, and then I would receive a message, correct?

laxsu19 · May 21, 2009, 10:43am

Actually, it wasnt at the malloc elist_d, it was before that that the error msg came from (I forgot about the cudamallocs and memcpys i have in an initial function -its early…), but I tracked down that error (it was really just a switch in the arguments of cudamemcpy, i had it as host pointer then device pointer.

But now I find that the error is: Cuda error: kernel invocation: unspecified launch failure.

(I used the cudaSAFEerror from Dr Dobbs, if that looks familiar).

That happens right after i synchro threads, which is right after my kernel call.

seibert · May 21, 2009, 1:25pm

OK, unspecified launch failure is sort of the generic “segmentation fault” of CUDA, which means you probably have a memory access violation somewhere. A likely culprit is this 2D array mat_list_d. 2D arrays implemented this way are difficult to get right in CUDA. How do you load data into mat_list_d, and where is it defined?

laxsu19 · May 21, 2009, 4:46pm

You’re right, that seems to be my problem. I had an extra * in a line, and I was telling CUDA to look for a pointer when I actually wanted it to be looking for a value. I revised the code, but I don’t have access to a cuda compiler here at work. Ii’ll have to wait and see what happens when I get home.

Thanks again

laxsu19 · May 22, 2009, 1:15am

Hmmm, well I may or may not have been doing the 2-d arrays correctly (probably not). But, I converted the cuda part of my code to working with just a 1-d array (will be a big pain in the butt to convert all of the code), and I found out that after doing that, it STILL wasn’t working. Of course, I was confused. Well, then I found that defining my pointers as globals with device or constant was the final culprit. For as soon as I removed the device/constant flags, the code worked!

And, good news: now I get the correct answer. Does anyone know why I couldnt use the variable flags? I didn’t segregate if the problem was with using device or constant, but the one variable declared constant wasnt even used in the little CUDA part of the code I have written yet, so the constant likely wasnt the problem.

Thanks all

Topic		Replies	Views
"unspecified launch failure" - ERROR CUDA Programming and Performance	9	14323	July 19, 2011
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2641	August 2, 2010
"unspecified launch failure" but "No CUDA-MEMCHECK" CUDA Programming and Performance	7	7528	January 8, 2016
This code doesn't work maybe too much threads assigned? CUDA Programming and Performance	8	1092	February 2, 2014
Cuda runtime failing to launch kernel from CSharp Wrapper function. CUDA Programming and Performance	6	1383	April 13, 2016
Cuda code performance CUDA Programming and Performance	14	3176	December 16, 2014
Unspecifiec launch failure on CUDA_SAFE_CALL(cudaThreadSynchronize()) CUDA Programming and Performance	5	2121	January 27, 2011
A naive question about dereferenced pointers unspecified launch failure CUDA Programming and Performance	10	2938	September 24, 2008
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16349	September 29, 2017
Unspecified launch failure CUDA Programming and Performance	6	10782	April 30, 2013

Embarassingly beginner question

Related topics