Hi there!
Here’s a simple kernel that sums 32-element array filled with ones in one warp on GTX 960:
global void addKernel(int *c, const int *a)
{
if (blockDim.x > 32)
return;
__shared__ /*volatile*/ int pSum[48];
size_t nIdx = threadIdx.x;
pSum[nIdx] = a[nIdx];
pSum[nIdx] += pSum[nIdx + 16];
pSum[nIdx] += pSum[nIdx + 8];
pSum[nIdx] += pSum[nIdx + 4];
pSum[nIdx] += pSum[nIdx + 2];
pSum[nIdx] += pSum[nIdx + 1];
if (nThreadIdx == 0)
c[0] = pSum[0];
}
On the latest driver version (>=375.86) the kernel produces the result of 12 instead of 32 with earlier driver.
Important notes: the kernel is built without GPU debug info (nvcc -G flag) and the shared variable “pSum” declared as non-volatile. Reproduced with CUDA 6.0/7.5
So I wonder if there was a driver issue?