Is it possible to use initial threads ( eg 1 to 32) to just load data into the shared memory and rest of the threads using that data? I ask this because i guess once a warp starts it is not necessary that it finishes execution before next warp starts and in that case my current model will give incorrect answers.
Is there any way to initialize shared memory before threads start? This is because i will be using shared memory to store some sub-images and i need to pad each subimage with zeros. Is there any other technique any one can suggest to achieve this?
But the problem here is that i wanna use __syncthreads() for only the first 32 threads then will the rest of the threads determine this dependency and wait?
So if i write following code :
if (threadIdx.x <32)
{
//load data
__syncthreads();
}
else {
// code for rest of the threads which will use data loaded by first 32 threads