NVidia + OpenCL + structs glitches

I have noticed that the NVidia OpenCL implementation w.r.t. structs does not behave as expected. We have an application that passes the following struct as an argument in global memory:

typedef struct {
  float4 row1;
  float4 row2;
  float4 row3;
} mat3x3;

typedef struct {
  mat3x3 transform;
  mat3x3 inverse;
  float4 arg1;
  float4 arg2;
  float4 arg3;
  float4 arg4;
} transformation_t;

In the OpenCL code it is used like this:

float3 transform_coordinate(mat3x3 transformation, float3 coord){
  float3 transformed_coord;
  transformed_coord.x = dot(transformation.row1.xyz, coord);
  transformed_coord.y = dot(transformation.row2.xyz, coord);
  transformed_coord.z = dot(transformation.row3.xyz, coord);
  return transformed_coord;

kernel my_kernel(..., global transformation_t *trans, ...) {
  float3 float_coord = ...;
  float3 transformed = transform_coordinate(trans->transform, float_coord);

In the CPU code we simply create an array of 160 bytes with all the floats (e.g. no padding). The pointer passed in clEnqueueWriteBuffer is aligned to a 16-byte boundary.

In the past I have already noticed some glitches (e.g. some kernel executions read/write wrong values).

  1. Linux + NV driver 331: using int4 for ‘arg1’ causes artifacts
    -> worked around by using float4
  2. Windows + NV driver 340: using float3 for mat3x3 causes artifacts
    -> worked around by using float4

Now I have a new issue after updating my driver:
3) Windows + NV driver 352.63: using float4 for mat3x3 causes artifacts
-> using float3 again seems to avoid the issue

So I wonder:

  • Can anybody explain this behavior?
  • Did anything change recently in the NVidia drivers that could explain this?
  • Is there anything that I might be doing wrong?

If you can create a simple code the demonstrates the issue, my suggestion would be to file a bug at developer.nvidia.com (if you are not already a registered developer, you would need to register first.)

Yes, I know that creating a reproducible sample is very convenient for the NVidia dev’s but that requires quiet a bit of work on my end. As you can expect with such issues, it does not occur in the simple setup.

I also noticed that the behavior changed yet again with the most recent drivers:
4) Windows + NV driver 361.75: I now have the artifacts in both the float4 and float3 case
-> I have not found a workaround for this

I do not have any issues on AMD or Intel GPU’s. Also the CPU implementations of both Intel and AMD work fine.

Providing a minimal example code that reproduces an issue is a standard requirement when reporting compiler issues and it has been like that for the past twenty years that I have reported such bugs, against both open source and commercial compilers.

Obviously, producing a cut-down version of some misbehaving code is real work, so it’s a trade-off: Unreported bugs are not likely to get fixed.