Compiler bug in 4.1 & 4.2?

Hi all, before I report this as a bug I just wanted to do a sanity check and make sure others agree as to what the correct output of the following program should be.

#include <stdio.h>

#include <thrust/host_vector.h>

#include <thrust/device_vector.h>

template<int nThreadsPerBlock, int WarpSize>

__global__ void testKernel( float* out )


  const int nWarpsPerBlock = nThreadsPerBlock / WarpSize;

  const int iWarp = threadIdx.y;

  const int iLane = threadIdx.x;

float a = 1.0f;

  float b = 0.0f;

volatile __shared__ float bsum[ nWarpsPerBlock ][ WarpSize + WarpSize/2 ];

  volatile float *sum = bsum[ iWarp ];

sum[ iLane ] = a;

  a = sum[ 0 ];

  sum[ iLane ] = b;

if( iLane == 0 )


    out[ 0 ] = a;



int main()


  thrust::device_vector<float> output(1);

const int nThreadsPerBlock = 32;

  const int WarpSize = 32;

  testKernel<nThreadsPerBlock, WarpSize><<< 1, dim3(32, 1) >>>

    ( thrust::raw_pointer_cast( &output[0] ) );

float out = output[ 0 ];

  printf( "output=%f\n", out );


It seems to me that the output should be 1. Cuda 4.0 agrees. 4.1 & 4.2 output 0. Adding a syncthreads above the if statement make 4.1 and 4.2 give 1 as well, but it doesn’t seem that a syncthreads should be needed. Am I missing something about warp synchronous programming here?



Best I can tell, in the absence of __syncthreads() there is nothing that prevents the compiler from arranging the assignments such that the output is either 0 or 1. Use of “volatile” just forces the affected variables into local memory, but does not affect the ordering of the assignments.

Making ‘a’ volatile also fixes the problem. Which makes more sense to me, the compiler can no longer move the load of ‘a’ around. I don’t quite understand why a syncthreads coming after all three of the assignment statements changes how the compiler re-orders those statements though.

Thanks for the help!

That makes sense to me. Making all the variables volatile should fix the order of assignments sufficiently under C/C++ semantics. The way I look at it, what prevents the compiler from moving the following code up (relative to text order)?

if( iLane == 0 )  {    out[ 0 ] = a;  };

You might want to check whether the version with __syncthreads() is faster or slower than using volatile. Use of the volatile qualifier typically forces a variable into local memory. My own preference is for using __syncthreads() for clarity, rather than “abusing” volatile.


I checked with a couple of compiler engineers, and they indicated my analysis is flawed and it does look like there is a bug here. They agree that the result should be 1 (4.0 result). Can you please file a bug against the compiler? Thank you for your help, and sorry for incorrect analysis.