One out of many threads respond with correct answer

Hi all,

I am really perplexed by the behaviour of my kernel:

only one or two of the 5 threads (the program will run more threads when it works) comes back with the same answer as the same function being run on the cpu!

In the kernel I have a few lines where the threads access an index of an array that has nothing to do with what thread I am in (i.e., array[threadIdx.x]=otherarray[independent_number];).

If I just set that line equal to a constant and not otherarray[…] then the cpu and gpu match for all threads, so this makes me think thats where my problem is, but why??

I NEVER write to otherarray, only read. It is sort of my ‘global’ problem data that calculations depend on, but they never change.

I would love to put the otherarray into constant space but it is too large to fit. Something like 76kb.

Thanks all,


I just want to add that I have tried this kernel in a driver code and it works perfectly! Any idea whats goign on?

I don’t think there is enough information to tell what the problem is. My only speculation is that otherarray on the GPU doesn’t contain what you think it does for some reason. Can you reproduce the problem with a simple enough program that you can paste it here?

Maybe some pointers point to host memory instead of device?
I believe we could help you much more if you could paste the code in question instead of just describing what it is supposed to do.

Ok, sorry about the delay, I’ve been out of commission for a while (percocet is not conducive to good programming!)

I narrowed down the problem further, if my zaid is something larger than 9999, that is when the GPU produces the correct answer.

Is there something about integer division in CUDA that I don’t know about??? I understand its slow, but it should still produce the correct answer, right?

The mat_list_d array is a 1-d array that I converted to 2-d. Its also possible that I am not accessing it the same way I think I am, so I’ve included the function that converts the 2d to 1darray below.

Thanks all!

[codebox]global void gpuscatter_iso(neutron* elist_d, materials* mat_list_d, int offset, int listlength)

//This calcs a new u,v,w, and energy after an isotropic collision


int index;


if (index<listlength)


    int A;


        int target_nuclide=elist_d[index].target_nuclide;

        int cell=elist_d[index].cell;

        int B=mat_list_d[NUCLIDE_MAX*target_nuclide+cell].zaid;

        int C=B/1000;



    float mu_cm=gpurng(&elist_d[index].seed,2.0f)-1.0f;

    float temp;


        float new_energy=elist_d[index].energy*(A*A+2.0f*A*mu_cm+1.0f)/(A*A+2.0f*A+1.0f);




    float cos_phi=cos(atan(sin(acos(mu_cm))/(1.0f/A+mu_cm)));

    float sin_phi=sin(acos(cos_phi));

    float cos_w=gpurng(&elist_d[index].seed,2.0f)-1.0f;

    float sin_w=sin(acos(cos_w));

    temp=sin_phi/(sqrt(1-(elist_d[index].oz)*(elist_d[index].oz)));//reused to save registers

    float new_u=temp*((elist_d[index].oy)*sin_w-(elist_d[index].oy)*(elist_d[index].ox)*cos_w)+(elist_d[inde


    float new_v=temp*(-(elist_d[index].ox)*sin_w-(elist_d[index].oz)*(elist_d[index].oy)*cos_w)+(elist_d[inde



x].oz)*cos_phi; //used instead of float new_w to save registers

    cos_phi =new_u*new_u+new_v*new_v+temp*temp; //reused to save registers

    if (cos_phi>1.0f)













void convert2darray(materials mat2d[NUCLIDE_MAX], materials * mat1d)


int k=0;

for (int i=0; i<OBJECT_MAX; i++)

	for (int j=0; j<NUCLIDE_MAX; j++)




Well, I guess I can’t blame the percocet… but i just figured it out, had my access to mat_list_d wrong. works good now. wow, now its time to do my timing studies! thanks for the help everyone!