CL_INVALID_COMMAND_QUEUE error due to local memory byte alignment

Got stuck on this problem for over two days, but after lots of googling, I finally narrowed the problem down to the local/shared memory byte alignment on nvidia gpus.

The problem I am experiencing is almost identical to what was described in this stackoverflow thread (probably can be traced back to somewhere in this forum)

https://stackoverflow.com/questions/36208452/dynamic-allocation-in-shared-memory-in-opencl-on-nvidia?answertab=votes#tab-top

basically, I dynamically allocate a local/shared memory buffer use clSetKernelArg(…,NULL) and use it in a kernel

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_host.cpp#L578
https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1095
https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1113

The requested shared mem size is only about 1280 bytes.

This kernel works perfectly fine on AMD/Intel CPUs and GPUs, but fails on all tested NVIDIA gpus (titan v, 1080ti, 1080 …). Running cuda-memcheck with my opencl code gave the following error:

========= Invalid __shared__ write of size 4
=========     at 0x000002b0 in mcx_main_loop
=========     by thread (24,0,0) in block (149,0,0)
[b]=========     Address 0x00000241 is misaligned
[/b]=========     Device Frame:mcx_main_loop (mcx_main_loop : 0x2b0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 [0x102d2f]
=========     Host Frame:../../bin/mcxcl [0x567b]
=========     Host Frame:../../bin/mcxcl [0x208e]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:../../bin/mcxcl [0x20f9]

so, the entire passed local/shared mem buffer can not be read/write.

From reading the above stackoverflow reply, it looks like the compiler asks the shared mem buffer to be 8-byte aligned, but I don’t know how to do that. I declared the local pointer using the align syntax, but not helpful

__local float *ppath  __attribute__ ((aligned (32)));

looks like the misalignment happens on the host side.

can someone tell me how to fix this? I can see a bunch of of similar reports for nvidia gpus

https://devtalk.nvidia.com/default/topic/911395/cl_invalid_command_queue-when-clfinish/
https://community.khronos.org/t/clfinish-is-returning-cl-invalid-command-queue/4012
https://devtalk.nvidia.com/default/topic/501409/cl_invalid_command_queue-error-on-clfinish-command-a-lot-of-operations-in-each-kernel-driver-crash/?offset=2
https://stackoverflow.com/questions/35190126/opencl-clfinish-returning-36

never mind, problem fixed after reading the stackoverflow post, need to define the sharedmem paramter using __local ulong *. (__local float seems to work as well).

A shared write of size 4 only requires alignment to a 4-byte boundary.

As suggested in the (first) SO link you provided, the problem there was declaration of a void pointer and expecting it to be aligned to some level. A proposed solution was not to declare a __local void pointer, but instead declare the __local pointer to a type that would be aligned as expected.

You appear to be doing this:

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1095

__local float *sharedmem

should provide the necessary alignment for 4 byte reads/writes.

However this is a bit worrisome, but may not be an issue:

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1113

You would have to convince yourself that the pointer arithmetic there (since you are casting internally to a char pointer) always produces an offset that is whole-number multiple of 4.

Anyway, the way I would debug this is:

  1. identify the line of code that is causing the misaligned access
  2. debug up to that point, and print out the numerical value of the __local pointer that is being dereferenced in the misaligned access. (Clearly, an address like 0x00000241 as reported in your cuda-memcheck output is obviously misaligned for a 4-byte access).
  3. If you find that the pointer (base + offset) is not misaligned, that would be quite weird. However presumably you will find that it is misaligned. If that is the case, then debug backward to the point where the pointer or pointer arithmetic produces a misaligned value.
  4. If you find that the base pointer passed to the kernel in:

__local float *sharedmem

is misaligned (inspect the numerical value of it, or print it out, or whatever) then that would be truly strange. At that point I would suggest creating a minimal reproducible example of it, and if that passes the sniff test, then file a bug. On the other hand, if the base pointer is aligned, then some sort of arithmetic that you are doing to it in your code is resulting in the misalignment. You would have to identify that and fix it. It is a bug in your code.

thanks Rob, your comments are helpful as always

again, what I previously had was __local char* sharedmem:

https://github.com/fangq/mcxcl/commit/136f3bf1f94882d388cc72283c031f4f80c73f6e#diff-3e7bff849d973dfbbbf2ff6591ee8862L1095

but after changing it to __local float*, the issue is gone.

incidentally, that was true. my shared mem buffer is a stack of floats, so it is always multiple of 4 bytes

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_host.cpp#L313

is this requirement also true for CUDA? I declared shared char * in my cuda code, but has never experienced issues for the cuda version

https://github.com/fangq/mcx/blob/master/src/mcx_core.cu#L97

anyways, would be nice of the compiler warns about misalignment, especially it seems that such requirement is only specific to NVIDIA GPUs/compiler.

I’m not sure what “again” means. Nowhere in your original posting did you indicate that you had a __local char* sharedmem, and the link you provided and which I referred to:

https://github.com/fangq/mcxcl/blob/mcx20197/src/mcx_core.cl#L1095

indicates

__local float *sharedmem

I’m not sure how I was supposed to know you had a __local char *sharedmem somewhere.

Certainly CUDA requires natural alignment. That is covered in the CUDA programming guide. I guess your question is “does CUDA guarantee that the pointer passed on dynamic shared memory allocation is always aligned to a particular boundary” I really don’t know the answer. I would do the same thing in CUDA that I have suggested in OpenCL. If you need a dynamically allocated shared pointer whose base alignment is compliant to 8-byte boundary, I would declare it as pointing to an 8-byte type. Beyond that I would use align directives which are covered in the CUDA programming guide. Going from a char pointer (or void pointer) to another (larger-granularity) pointer is fraught with peril, if for no other reason because of the CUDA requirement for natural alignment. I just don’t see much of that type of pointer casting in CUDA and it needs to be handled carefully, even in CUDA.

The compiler can’t detect misalignment. It has no idea what the numerical value of a pointer is, and doesn’t in the general case know what your offset calculations will amount to.

Perhaps what you are saying is the compiler should warn any time you are casting from a pointer of lower granularity to a pointer of higher granularity. I haven’t given a lot of thought to that, but the usual suggestion is to file an RFE (follow the bug filing directions in the sticky post at the top of this sub-forum).

sorry for the confusion. perhaps I should stop referencing line of codes using the HEAD version, instead, using a specific commit, since HEAD is always changing.

at the time when I posted the initial question, the sharedmem variable was still __local char*, as in this

https://github.com/fangq/mcxcl/blob/c0d0c9aa9e97e53404c7f10df41dc7138b3cbb1a/src/mcx_core.cl#L1095

then, just as I posted my 2nd post, I modified it to __local ulong * as in

https://github.com/fangq/mcxcl/commit/136f3bf1f94882d388cc72283c031f4f80c73f6e

and, shortly after my 2nd post, and before your reply, I changed it to __local float*, as in

https://github.com/fangq/mcxcl/commit/53bcd772a3f1cd162c11a1c2ea21e4755e29202a

again, sorry that the HEAD link has changed several times.

good to know. I will change my sharedmem data type to something align with float to be on the safe side.

are you saying that the AMD/Intel OpenCL compiler worked and produced correct execution results were just by accident? When I searched the NVIDIA OpenCL best practices guide,I did not see such warnings, but it sounds like a general guideline that I should remember.

To my knowledge, x86 CPUs don’t generally have the alignment requirements that GPUs have, for non-vector-unit loads/stores.

https://stackoverflow.com/questions/3025125/cpu-and-data-alignment

For example, an x86 machine (AFAIK) can load a 4-byte integer from any byte offset, e.g. base_addr+0, or base_addr+1, or base_addr+2, or base_addr+3.

I don’t know how AMD or Intel GPUs work in this respect. I couldn’t say precisely what is happening in non-CUDA-GPU cases.

The most recent NVIDIA OpenCL programming guide I found is certainly not as clear as the CUDA programming guide in this respect. However it clearly states that reading unaligned 8 or 16-byte quantities from global memory will produce unexpected results:

http://developer.download.nvidia.com/compute/DevZone/docs/html/OpenCL/doc/OpenCL_Programming_Guide.pdf

3.3.2.1.1:

“Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results”

However there is no recital for 4 byte quantities (not sure why) and there is no equivalent recital for __local accesses, that I could find.

I think for safety, an OpenCL programmer should follow the same natural alignment requirements that CUDA has. If you can find chapter and verse in the OpenCL specification that says that these kinds of unaligned accesses should be allowed (without adverse effects) then I would suggest to file a bug. You’re also welcome to file bugs against the NVIDIA OpenCL documentation.

Now I encountered a new issue.

I realized that in the shared memory buffer, I do not only write floats, but also RandType (typedef from ulong, 8-byte) buffers. All the RandType are located at the beginning of the shared-mem buffer, followed by 4-byte floats.

you can see from my __local float* ppath definition, it skips the first b[/b] bytes of data, reserved for the RandType data.

https://github.com/fangq/mcxcl/blob/c0d0c9aa9e97e53404c7f10df41dc7138b3cbb1a/src/mcx_core.cl#L1113

If I define sharedmem parameter as __local float* in the kernel argument, then, when writing to the RandType data buffer at the first half of my shared mem buffer, nvidia gpu gives the same misalignment error (again, such scenario also occurs in my CUDA version, but had never been an issue, only becomes a problem in opencl).

I could only resolve this by declaring sharedmem as __local RandType *sharedmem in the kernel argument.

I just want to make sure I understand this: when I intend to write multiple data types in the shared mem buffer, if these type sizes are in the multiple of each other(such as double, float, int), then the sharedmem pointer should be declared as the one with the largest size?

I am not currently using this feature, but, just trying to understand this better, is it possible to store a mixture of double, float and short numbers in the shared mem buffer?

on a side note, cuda-memcheck seems to work with opencl kernel on the NVIDIA GPU, but it does not print the kernel line numbers as it does for CUDA. For example, the previous error I got

========= Invalid __shared__ write of size 4
=========     at 0x000002b0 in mcx_main_loop
=========     by thread (24,0,0) in block (149,0,0)
=========     Address 0x00000241 is misaligned
=========     Device Frame:mcx_main_loop (<b>mcx_main_loop : 0x2b0</b>)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 [0x102d2f]
=========     Host Frame:../../bin/mcxcl [0x567b]
=========     Host Frame:../../bin/mcxcl [0x208e]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:../../bin/mcxcl [0x20f9]

on line#5, it only shows my kernel name and a hex number “0x2b0”.

I am wondering if there is a clBuildProgram flag on the NVIDIA ocl to backtrace the source code lines? I tried -g and -backtrace (seems to be supported with nvcc) but neither of these works (both caused an error: “Unrecognized build options”)

would be super useful if cuda-memcheck can backtrace the line number for opencl.

I expect it should be possible. If you do

__local double *data

then the pointer you receive should be 8-byte naturally aligned (specifically, the lower 3 bits of the pointer numerical value/address should all be zero). Thereafter you simply need to make sure any accesses within that pointer “space” are naturally aligned. You can cast that pointer to a float and use it safely for float quantities. You can cast that pointer to short and use it safely for short quantities. When I say “that pointer” I mean

(data + C)

exactly as I have written, where C is a non-negative integer index that is within the valid allocated range of the pointer data. You can take that pointer, and cast it to either float or short, and use it safely.

Other things are safely possible, but some other things are not.

This is acceptable:

*((double *)(((float *)data) + 2))

For example, this would be illegal:

*((double *)(((unsigned short *)data) + 1))

If you need to do intermixing of various types, it would be a good idea to obtain a solid grasp of the natural alignment requirement.

If you attempt to dereference an 8-byte quantity pointer (e.g. double, unsigned long long, etc.) and the numerical value of the pointer does not have all 3 least significant bits set to zero, then you have made a mistake in pointer arithmetic in your code. It is a bug in your code.

If you attempt to dereference a 4-byte quantity pointer (e.g. float, int, etc.) and the numerical value of the pointer does not have all 2 least significant bits set to zero, then you have made a mistake in pointer arithmetic in your code. It is a bug in your code.

If you attempt to dereference a 2-byte quantity pointer (e.g. short, etc.) and the numerical value of the pointer does not have the least significant bit set to zero, then you have made a mistake in pointer arithmetic in your code. It is a bug in your code.

These statements include the concept already covered in this thread: that evidently the __local pointer passed to an openCL kernel for NVIDIA may only be aligned to the type you specify. If you fail to account for this, it is a bug in your code. To repeat, as an example, if you specify:

__local float *data

then that pointer is only guaranteed to have the 2 least significant bits set to zero. The 3rd least significant bit may be set to 1. If you wish all 3 least significant bits to be set to zero, then choose a pointer type that refers to an 8-byte element.

If you wish, you’re welcome to file a bug (RFE) using the instructions linked in a sticky post at the top of this forum.

thanks, a feature request was submitted as Bug#2657654.