Got stuck on this problem for over two days, but after lots of googling, I finally narrowed the problem down to the local/shared memory byte alignment on nvidia gpus.
The problem I am experiencing is almost identical to what was described in this stackoverflow thread (probably can be traced back to somewhere in this forum)
basically, I dynamically allocate a local/shared memory buffer use clSetKernelArg(…,NULL) and use it in a kernel
https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_host.cpp#L578
https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1095
https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1113
The requested shared mem size is only about 1280 bytes.
This kernel works perfectly fine on AMD/Intel CPUs and GPUs, but fails on all tested NVIDIA gpus (titan v, 1080ti, 1080 …). Running cuda-memcheck with my opencl code gave the following error:
========= Invalid __shared__ write of size 4
========= at 0x000002b0 in mcx_main_loop
========= by thread (24,0,0) in block (149,0,0)
[b]========= Address 0x00000241 is misaligned
[/b]========= Device Frame:mcx_main_loop (mcx_main_loop : 0x2b0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 [0x102d2f]
========= Host Frame:../../bin/mcxcl [0x567b]
========= Host Frame:../../bin/mcxcl [0x208e]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
========= Host Frame:../../bin/mcxcl [0x20f9]
so, the entire passed local/shared mem buffer can not be read/write.
From reading the above stackoverflow reply, it looks like the compiler asks the shared mem buffer to be 8-byte aligned, but I don’t know how to do that. I declared the local pointer using the align syntax, but not helpful
__local float *ppath __attribute__ ((aligned (32)));
looks like the misalignment happens on the host side.
can someone tell me how to fix this? I can see a bunch of of similar reports for nvidia gpus
https://devtalk.nvidia.com/default/topic/911395/cl_invalid_command_queue-when-clfinish/
https://community.khronos.org/t/clfinish-is-returning-cl-invalid-command-queue/4012
https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2
https://stackoverflow.com/questions/35190126/opencl-clfinish-returning-36