It runs very slow. So I use Nsight to find that the data are stored in local memory.
Can anyone tell me why the data are in local memory? And how can I avoid to put them in the local memory?
Thanks a lot.
It is probably “very slow” because you have mixed single and double precision arithmetic in the kernel, and the double precision is half the peak flops of single precision on your C2050. The local memory usage probably comes from the exp function, and there is nothing you can do about that. A check of the PTX output from the compiler will confirm this.
I would not expect any local memory usage when this kernel is compiled with compiler default, and I see that Lung Sheng has already confirmed this. What compiler switches are you using, and what is the compiler output after adding -Xptxas -v to the nvcc commandline?
If memory serves, the only math library functions that use a bit of local memory are the trigonometric functions [this is documented in the Programming Guide], and they use local memory only in a “slow path” that is extremely unlikely to be taken in real-life code so there is no performance impact from this limited use of local memory.
Sorry, I don’t know why the picture is so small. Please double click on it, and you will see the detail.
Here the kernel LLR2q() is just equal to the above kernel in the example. The local memory is 45481984, so I think the data are stored in the local memory
When I use this shared memory, the runtime is longer than the original kernel log2exp(), can anyone tell me why? Is there something wrong with my shared_memory_version kernel?