Suppose I have a kernel (whose header is given below) which is written below, Now How should I effectively use shared memory to gain further performance.
Right now, I am accessing the elements using global memory.
func<<625,256>>(float* outptr, float* arguments).
. and my output array size being 1,6,0000 element.
Now In my application each array element is processed by a single thread index in random fashon. ( thread 1 is processing array 1, or thread 2 is processing say array 3 and so on. As I am using global memory, which this process takes a huge amount of time.
How can I keep my elements in shared memory and gain performance.
As long as you access global memory once to read the value and once to write it I do not think much optimization can be done. When you access an element various times it is advisable to store the value in a local variable in the kernel thus having this variable stored either in registers or shared memory.
Coalescing is a very important concept in CUDA - it is essential to getting any real speed. If you’re not familiar with it, check the manual + the forums.