Number of threads affecting answer; this should not happen a VERY strange error..

nitin.life · July 17, 2009, 9:49am

Hi all

I have been having a very strange error. My kernel gives my “right” answer only if my number of threads per block is <=8 . Number of blocks don’t matter.

Generally, this would mean I have some problem in shared memory access (some type of conflict). But in my case there is NO inter-thread data transfer via shared memory, each thread just reuses it own bunch of shared memory in the kernel.

Hence as there is no co-orporation either at global memory or shared memory level , this shouldn’t happen rite ? External Image …!

I do launch multiple kernels with each kernel using the results from the previous kernels. But kernels cannot run concurrently, i guess ?

What I mean is something like this:

gpu_ford_phia<<<dimGrid,dimBlock>>>(gpu_space,ff,tempvar);		

	   for(int i=0;i<12;i++)

	   {	

	   gpu_ford_phib<<<dimGrid,dimBlock>>>(gpu_space,ff,ypass,tempvar,i);	

	   gpu_ford_phic<<<dimGrid2,dimBlock2>>>(gpu_space,ff,ypass,tempvar,i);	  

	   }

Here I am assuming kernels run one after the another.

The code runs fine and gives rite vales in device-Emu mode for any number of threads per block. :unsure:

For number of threads <=8 i get answer correct up-to 13 digit in device mode (am using double precision) but if my number of threads per block is say 64 my answers are TOTALLY wrong.

I don’t understand what is going on :( … I think it should work correctly.

I have attached the kernel here along with the main file.

Please is someone has time take a quick look if you can find any errors in the code :unsure: … am really desperate as my advisor wants some result in 2 days.

Also anyone has any clues why such a thing can happen ?

Thank you all… really appreciate any help

Nittin Arora
testcode.zip (4.69 KB)

Sarnath · July 17, 2009, 11:41am

Check your code for races.

Sarnath · July 17, 2009, 11:49am

I just see this code in your phic kernel…

for(k=0;k<3;k++)

			 a += ff[k+mul].x[indx]*sh[k][ind];

You are using “k” as FOR loop index here…

In the code above this, you are calculating the value of “k” in a loop.

And that value is completely destroyed by using “k” in the FOR loop…

Is that a bug?

nitin.life · July 17, 2009, 4:49pm

I just see this code in your phic kernel…
for(k=0;k<3;k++)

			  a += ff[k+mul].x[indx]*sh[k][ind];
You are using “k” as FOR loop index here…

In the code above this, you are calculating the value of “k” in a loop.

And that value is completely destroyed by using “k” in the FOR loop…

Is that a bug?

No its not a bug , the previous k value was just there for calculating the rite index to copy a column (last three values) to the sh variable.

The next use of k just loops over the 3 values stored and multiplies them with a another matrix in global memory. I am doing this to minimize register usage.

and I am sorry, but I don’t understand how can there be a race condition if there is no inter thread co-orporation going on… ?

… why does it only give correct answer run for <=8 threads per block…

thanks for looking at my code :)

NA

rewolf · July 17, 2009, 5:54pm

how come im struggling to download the zip file… i feel seriously computer illiterate… keeps redirecting me. could you maybe pastebin it, or explain how to download it haha
aren’t threads per block supposed to be >=32 seeing that warpsize is 32. (not that it helps your problem - just a question)

nitin.life · July 17, 2009, 6:22pm

Hmm that’s strange… anyway…

Here is the paste-bin link to the code:

the kernels: link kernels

the main calling program : link main

and yes my actual block size is 320, I was just testing to find for which value of block size the problem occurs.

thanks

NA

jacoblyles · July 17, 2009, 6:31pm

Try calling cudaThreadSynchronize() between your kernel calls.

nitin.life · July 17, 2009, 6:49pm

Nope that does not help… the kernels are executed serially anyway

Also I found one more thing, if I print results (after memcpy back to cpu) which were evaluated by threadidx.x = 0 ; they are right only when block size < = 8.

I think thread 0 should behave the same way irrespective of the block size ? :( … this is getting stranger

thanks all…

nitin.life · July 17, 2009, 7:41pm

I found the part of the code which is not getting executed properly

in kernel gpu_for_phib

switch(i)

										{

									   case 6:  ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 13: ypass[i].x[indx] = 1.0 + h*cpy2;break;				

										case 20: ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 27: ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 34: ypass[i].x[indx] = 1.0 + h*cpy2;break;

										case 41:ypass[i].x[indx]  = 1.0 + h*cpy2;break;

										default:ypass[i].x[indx]  = h*cpy2;break;

										}

This switch block is in an if else statement as shown in the kernel.cu file.

Investigating deeper I found out… both the switch statements in my code are not executing properly. If I convert them to if loops the code works fine for blocksize > 8. But why is this the case ?

I don’t understand what is wrong with this :mellow: , maybe the compiler is generating some wrong look up table or something or am I doing some stupid mistake here in these switch statements?

Topic		Replies	Views
do not understand thread/block division CUDA Programming and Performance	10	2894	April 23, 2012
Problem when using more than 64 threads per block CUDA Programming and Performance	3	4330	May 4, 2010
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2463	October 12, 2021
Blocks execution problem Unexpectable behavior CUDA Programming and Performance	4	857	February 2, 2011
Troubles with synchronization of threads (I think) CUDA Programming and Performance	0	2358	July 29, 2009
problem for kernel function threads in one block must run the same algorithm? CUDA Programming and Performance	2	2305	May 15, 2009
blocks vs threads and bad CUDA performance CUDA Programming and Performance	3	3609	January 23, 2015
An illegal memory access was encountered CUDA Programming and Performance cuda	2	939	December 1, 2022
Unexpected Behaviour 2+ block thread communication issues CUDA Programming and Performance	0	2302	February 23, 2010
Cuda shared memory within thread blocks problem CUDA Programming and Performance	2	6670	February 24, 2011

Number of threads affecting answer; this should not happen a VERY strange error..

Related topics