Unsynchronized shared memory access

In the following example:
CUDA-8.0_Samples/3_Imaging/dct8x8/dct8x8_kernel2.cuh

I used that code in my program and it works fine. However I think it’s incorrect. The threads within the same warp writes to shared memory to exchange data, but there’s no __syncthreads() call and the shared memory isn’t declared volatile.

I found out the hard way that ‘volatile shared’ is required even within the same warp. I didn’t suspect this because I assumed the example code was correct.

volatile __shared__ __half shared_mem[12*32*32];
shared_mem[0] = __float2half(0);

test.cu(149): error: no operator “=” matches these operands
operand types are: volatile __half = __half

Works if I remove volatile. Sigh.

define your shared memory normally

__shared__ __half shared_mem[...];

when you need a volatile version of it, define a volatile pointer and use that:

volatile half *vol_shared_mem = shared_mem;

Only use the volatile version when you need to, because it reduces the compiler’s ability to optimize.

#include <cuda_fp16.h>
__global__ void kernel(){
  __shared__ half shared_mem[12*32*32];
  shared_mem[0] = __float2half(0);
  volatile half *vol_s_m = shared_mem;
  ...
}

That doesn’t work.

__shared__ half shared_mem[12*32*32];
volatile half *vol_mem = shared_mem;
half foo = vol_mem[0];

test.cu(154): error: class “__half” has no suitable copy constructor

It looks like compiler support for fp16 is broken.

Feel free to file a bug. The issue is not with the compiler, but with the specific implementations of half and that particular intrinsic.

Filed #1904069.

Thanks.