This is my first post. I have a task cut out.
I need to port a function to CUDA. Function reads parametes from file and computes certain other parameters. There are about 13 parameters.
Initially when I wrote the function in a single kernel, all these parameters were hardcoded. I achieved performance and my output was also correct.
Now when I pass these 13 parameters from host to device (kernel function), my performance degrades substanially and there was no resultant output.
If I execute in EMUDebug mode, I am getting the desired output.
What else can be problem. Please advice, this is my first post.