I need to port a function to CUDA. Function reads parametes from file and computes certain other parameters. There are about 13 parameters.
Initially when I wrote the function in a single kernel, all these parameters were hardcoded. I achieved performance and my output was also correct.
Now when I pass these 13 parameters from host to device (kernel function), my performance degrades substanially and there was no resultant output.
If I execute in EMUDebug mode, I am getting the desired output.
What else can be problem. Please advice, this is my first post.
In your kernel, are you copying the whole structure from GPU mem to shared mem cache? OR are you using a local array?
It would be dead slow if you had used a local array instead of shared mem array.
It would be slow if all threads in the block copy the structure from global mem to shared mem – something like if you had used “sharedMemStructure = *gMem”. Then all threads will load it from gmem which could make it slow and redundant if you are using more threads per block and more blocks in your kernel.