Hi all
I have been having a very strange error. My kernel gives my “right” answer only if my number of threads per block is <=8 . Number of blocks don’t matter.
Generally, this would mean I have some problem in shared memory access (some type of conflict). But in my case there is NO inter-thread data transfer via shared memory, each thread just reuses it own bunch of shared memory in the kernel.
Hence as there is no co-orporation either at global memory or shared memory level , this shouldn’t happen rite ? External Image …!
I do launch multiple kernels with each kernel using the results from the previous kernels. But kernels cannot run concurrently, i guess ?
What I mean is something like this:
gpu_ford_phia<<<dimGrid,dimBlock>>>(gpu_space,ff,tempvar);
for(int i=0;i<12;i++)
{
gpu_ford_phib<<<dimGrid,dimBlock>>>(gpu_space,ff,ypass,tempvar,i);
gpu_ford_phic<<<dimGrid2,dimBlock2>>>(gpu_space,ff,ypass,tempvar,i);
}
Here I am assuming kernels run one after the another.
The code runs fine and gives rite vales in device-Emu mode for any number of threads per block. :unsure:
For number of threads <=8 i get answer correct up-to 13 digit in device mode (am using double precision) but if my number of threads per block is say 64 my answers are TOTALLY wrong.
I don’t understand what is going on :( … I think it should work correctly.
I have attached the kernel here along with the main file.
Please is someone has time take a quick look if you can find any errors in the code :unsure: … am really desperate as my advisor wants some result in 2 days.
Also anyone has any clues why such a thing can happen ?
Thank you all… really appreciate any help
Nittin Arora
testcode.zip (4.69 KB)