Got stuck on this problem for over two days, but after lots of googling, I finally narrowed the problem down to the local/shared memory byte alignment on nvidia gpus.
The problem I am experiencing is almost identical to what was described in this stackoverflow thread (probably can be traced back to somewhere in this forum)
basically, I dynamically allocate a local/shared memory buffer use clSetKernelArg(…,NULL) and use it in a kernel
The requested shared mem size is only about 1280 bytes.
This kernel works perfectly fine on AMD/Intel CPUs and GPUs, but fails on all tested NVIDIA gpus (titan v, 1080ti, 1080 …). Running cuda-memcheck with my opencl code gave the following error:
========= Invalid __shared__ write of size 4 ========= at 0x000002b0 in mcx_main_loop ========= by thread (24,0,0) in block (149,0,0) [b]========= Address 0x00000241 is misaligned [/b]========= Device Frame:mcx_main_loop (mcx_main_loop : 0x2b0) ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 [0x102d2f] ========= Host Frame:../../bin/mcxcl [0x567b] ========= Host Frame:../../bin/mcxcl [0x208e] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830] ========= Host Frame:../../bin/mcxcl [0x20f9]
so, the entire passed local/shared mem buffer can not be read/write.
From reading the above stackoverflow reply, it looks like the compiler asks the shared mem buffer to be 8-byte aligned, but I don’t know how to do that. I declared the local pointer using the align syntax, but not helpful
__local float *ppath __attribute__ ((aligned (32)));
looks like the misalignment happens on the host side.
can someone tell me how to fix this? I can see a bunch of of similar reports for nvidia gpus