In the following example:
CUDA-8.0_Samples/3_Imaging/dct8x8/dct8x8_kernel2.cuh
I used that code in my program and it works fine. However I think it’s incorrect. The threads within the same warp writes to shared memory to exchange data, but there’s no __syncthreads() call and the shared memory isn’t declared volatile.
I found out the hard way that ‘volatile shared’ is required even within the same warp. I didn’t suspect this because I assumed the example code was correct.
volatile __shared__ __half shared_mem[12*32*32];
shared_mem[0] = __float2half(0);
test.cu(149): error: no operator “=” matches these operands
operand types are: volatile __half = __half
Works if I remove volatile. Sigh.
define your shared memory normally
__shared__ __half shared_mem[...];
when you need a volatile version of it, define a volatile pointer and use that:
volatile half *vol_shared_mem = shared_mem;
Only use the volatile version when you need to, because it reduces the compiler’s ability to optimize.
#include <cuda_fp16.h>
__global__ void kernel(){
__shared__ half shared_mem[12*32*32];
shared_mem[0] = __float2half(0);
volatile half *vol_s_m = shared_mem;
...
}
That doesn’t work.
__shared__ half shared_mem[12*32*32];
volatile half *vol_mem = shared_mem;
half foo = vol_mem[0];
test.cu(154): error: class “__half” has no suitable copy constructor
It looks like compiler support for fp16 is broken.
Feel free to file a bug. The issue is not with the compiler, but with the specific implementations of half and that particular intrinsic.