After some further debugging, I found that the problem I asked about on the other thread (“__syncthreads() and global memory”) actually comes from the shared memory utilization strategy that I’m trying to put to use. Namely, depending on the problem size, the kernel in question could use either shared or global memory for given temporary array. I’m calculating the shared memory usage before launching kernel, and if decide that the problem is too big to use shared memory for the temporarily array, I allocate an array in global memory, and pass the pointer to the kernel. On the other side, if I decide that the problem is small enough for shared memory to be used for the temporary array, I pass 0 as this kernel argument value. The kernel itself then starts with code like this:
extern __shared__ char sharedData[];
float* temp = (tempd == 0) ? (float*) sharedData : tempd; // "tempd" is the kernel argument mentioned above
After looking through programming gude further, I realized that this is the case when nvcc cannot decide which kind of pointer (to shared or global memory) is used, and it seems to me that it decided for shared memory pointer interpretation (strange that no warning is emitted by the compiler) , thus __syncthreads() seems to be synchronizing for shared memory writes only.
Now, I guess my question would be: how best to code over the problem (one obvious solution is to have 2 versions of the kernel, that would differ only in the initialization of the “temp” pointer above)? Also, the pattern I tried to describe (use shared memory for intermediate results is possible, otherwise use global memory) seems like it has to be common enough to me, thus I’m wondering is there any other better approach?
Thanks.