Shared Memory allocation..

Hi Experts,

         If I declare a variable in shared memory inside my

    __global__ void MyKernel(.....float * a)
          __shared__ float As[6] = a;


   What will happen..!?
   Every thread separately allocate memory?(ie..for ex if i am executing 256 threads per multiprocessor, then will it allocate 256 times?)  

No, it will be one allocation per (executing) block. All threads within one block will see the same memory.

K.Thank you tera

If i am initializing the variable in my kernel like

As[0] = 0.2;

and again change the value in the kernel like

As[0] =As[0] + 0.5

In this case whether i need to use “__syncthreads()” in between them or not?

Depends from where you access the variable. As long as each array element is only accessed from the same thread, you do not need a __syncthreads(). But between accesses from different threads, you do.

(Actually you do not need __syncthreads() between accesses from within the same half-warp, but you can ignore this to keep things simple)

K.Thank you tera

I have added cuda kernel in my application.While executing the kernel its taking nearly 120 msec but while executing it 10 times in for loop its taking only 230msec…

I want to know whether because of first time execution iti s taking too much time? or problem in my program?

The first invocation takes longer because the kernel code needs to get copied to the device, and also compiled unless you’ve instructed nvcc to generate .cubin instead of .ptx. Nothing to worry about.