I just migrated some code to cuda, and have a big problem: It’s extremely slow!
The reason is probably, that i have many many parameters stored in global memory, which i have to use in every thread.
the cuda programming guide says, that a global memory acces needs about 400-600 cycles, which is ofc not good.
I will give an example:
global void CUDA_CalcBahn(long foo, float bar, float* foobar, float* result)
neither foo, nor bar, nor foobar get changed inside the kernel. “foobar” and “result” have been allocated on the device with cudaMalloc.
where and how is the usage of shared memory practical? can i assign the value of the parameter “foo” or “bar” once in each block to a shared memory variable, and read from it in all other threads of the block?
yours sincerely, confused snowball :)