Number of threads affecting answer; this should not happen a VERY strange error..

Hi all

I have been having a very strange error. My kernel gives my “right” answer only if my number of threads per block is <=8 . Number of blocks don’t matter.

Generally, this would mean I have some problem in shared memory access (some type of conflict). But in my case there is NO inter-thread data transfer via shared memory, each thread just reuses it own bunch of shared memory in the kernel.

Hence as there is no co-orporation either at global memory or shared memory level , this shouldn’t happen rite ? :no: …!

I do launch multiple kernels with each kernel using the results from the previous kernels. But kernels cannot run concurrently, i guess ?

What I mean is something like this:


	   for(int i=0;i<12;i++)





Here I am assuming kernels run one after the another.

The code runs fine and gives rite vales in device-Emu mode for any number of threads per block. :unsure:

For number of threads <=8 i get answer correct up-to 13 digit in device mode (am using double precision) but if my number of threads per block is say 64 my answers are TOTALLY wrong.

I don’t understand what is going on :( … I think it should work correctly.

I have attached the kernel here along with the main file.

Please is someone has time take a quick look if you can find any errors in the code :unsure: … am really desperate as my advisor wants some result in 2 days.

Also anyone has any clues why such a thing can happen ?

Thank you all… really appreciate any help

Nittin Arora (4.69 KB)

Check your code for races.

I just see this code in your phic kernel…


			 a += ff[k+mul].x[indx]*sh[k][ind];

You are using “k” as FOR loop index here…

In the code above this, you are calculating the value of “k” in a loop.

And that value is completely destroyed by using “k” in the FOR loop…

Is that a bug?

No its not a bug , the previous k value was just there for calculating the rite index to copy a column (last three values) to the sh variable.

The next use of k just loops over the 3 values stored and multiplies them with a another matrix in global memory. I am doing this to minimize register usage.

and I am sorry, but I don’t understand how can there be a race condition if there is no inter thread co-orporation going on… ?

… why does it only give correct answer run for <=8 threads per block…

thanks for looking at my code :)


how come im struggling to download the zip file… i feel seriously computer illiterate… keeps redirecting me. could you maybe pastebin it, or explain how to download it haha
aren’t threads per block supposed to be >=32 seeing that warpsize is 32. (not that it helps your problem - just a question)

Hmm that’s strange… anyway…

Here is the paste-bin link to the code:

the kernels: link kernels

the main calling program : link main

and yes my actual block size is 320, I was just testing to find for which value of block size the problem occurs.



Try calling cudaThreadSynchronize() between your kernel calls.

Nope that does not help… the kernels are executed serially anyway

Also I found one more thing, if I print results (after memcpy back to cpu) which were evaluated by threadidx.x = 0 ; they are right only when block size < = 8.

I think thread 0 should behave the same way irrespective of the block size ? :( … this is getting stranger

thanks all…

I found the part of the code which is not getting executed properly

in kernel gpu_for_phib



									   case 6:  ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 13: ypass[i].x[indx] = 1.0 + h*cpy2;break;				

										case 20: ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 27: ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 34: ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 41:ypass[i].x[indx]  = 1.0 + h*cpy2;break;

										default:ypass[i].x[indx]  = h*cpy2;break;


This switch block is in an if else statement as shown in the file.

Investigating deeper I found out… both the switch statements in my code are not executing properly. If I convert them to if loops the code works fine for blocksize > 8. But why is this the case ?

I don’t understand what is wrong with this :mellow: , maybe the compiler is generating some wrong look up table or something or am I doing some stupid mistake here in these switch statements?