Reduce kernel with latest NVIDIA Display Driver (>=375.86)

Hi there!
Here’s a simple kernel that sums 32-element array filled with ones in one warp on GTX 960:

global void addKernel(int *c, const int *a)
if (blockDim.x > 32)

__shared__ /*volatile*/ int pSum[48];
size_t nIdx = threadIdx.x;

pSum[nIdx] = a[nIdx];

pSum[nIdx] += pSum[nIdx + 16];
pSum[nIdx] += pSum[nIdx + 8];
pSum[nIdx] += pSum[nIdx + 4];
pSum[nIdx] += pSum[nIdx + 2];
pSum[nIdx] += pSum[nIdx + 1];

if (nThreadIdx == 0)
	c[0] = pSum[0];


On the latest driver version (>=375.86) the kernel produces the result of 12 instead of 32 with earlier driver.
Important notes: the kernel is built without GPU debug info (nvcc -G flag) and the shared variable “pSum” declared as non-volatile. Reproduced with CUDA 6.0/7.5

So I wonder if there was a driver issue?

For speed you should use warp shuffle reduction anyway.

Good suggestion by our chief financialist, because that would eliminate your problem with missing synchronization.
As it stands, your kernel has a bug.

The use of the volatile keyword for shared memory in a parallel reduction has been advocated at least since CUDA 3.0 came out… So leaving this keyword away could lead to all kinds of problems.

The use of shared memory for a parallel reduction is only necessary if you still have to support the Fermi platform or older. Kepler and newer architectures support warp shuffle.

Thanks for replies.
Yes, warp shuffle functionality is the right thing to go with, but we are still dealing with CC 2.0 at the moment…