Parameter Passing to Device

This is my first post. I have a task cut out.

I need to port a function to CUDA. Function reads parametes from file and computes certain other parameters. There are about 13 parameters.

Initially when I wrote the function in a single kernel, all these parameters were hardcoded. I achieved performance and my output was also correct.

Now when I pass these 13 parameters from host to device (kernel function), my performance degrades substanially and there was no resultant output.
If I execute in EMUDebug mode, I am getting the desired output.

What else can be problem. Please advice, this is my first post.

Therez some limitation on number of args that you can pass to the CUDA kernel.

Structurize them , copy out to a GPU mem location and pass that GPU location as a pointer to your kernel.

Infact I structurized them and copied to GPU mem location.

It was a catastrope. It is taking hell lot of a time, more than what it takes for the CPU equivalent.

Any other suggestions.

Can anyone provide me a solution.
A solution is badly needed.

Thanks in advance

Mathew Potter

Hi,
Can you post the code where

  1. the kernel is called with multiple arguments - the device kernel function definition (just the top line) and the host variable initialisation, and …
  2. the kernel is called in the struct manner - again, the function definition and host variable init / malloc / memcpy into device struct?

In your kernel, are you copying the whole structure from GPU mem to shared mem cache? OR are you using a local array?

It would be dead slow if you had used a local array instead of shared mem array.

It would be slow if all threads in the block copy the structure from global mem to shared mem – something like if you had used “sharedMemStructure = *gMem”. Then all threads will load it from gmem which could make it slow and redundant if you are using more threads per block and more blocks in your kernel.

HTH

OK I could fix it. It was internal error due to improper indexing.
Now I am passing arguments through an array.

Thanks a lot for all suggestions.