If I declare a variable in shared memory inside my kernel...ie..
__global__ void MyKernel(.....float * a)
{
.......
.......
__shared__ float As[6] = a;
.......
.......
}
What will happen..!?
Every thread separately allocate memory?(ie..for ex if i am executing 256 threads per multiprocessor, then will it allocate 256 times?)
:unsure:
Depends from where you access the variable. As long as each array element is only accessed from the same thread, you do not need a __syncthreads(). But between accesses from different threads, you do.
(Actually you do not need __syncthreads() between accesses from within the same half-warp, but you can ignore this to keep things simple)
I have added cuda kernel in my application.While executing the kernel its taking nearly 120 msec but while executing it 10 times in for loop its taking only 230msec…
I want to know whether because of first time execution iti s taking too much time? or problem in my program?
The first invocation takes longer because the kernel code needs to get copied to the device, and also compiled unless you’ve instructed nvcc to generate .cubin instead of .ptx. Nothing to worry about.