Builtin rotate() of 64-bit integer broken with NVIDIA CUDA 7.0 driver

I am getting wrong results when using the OpenCL builtin function rotate() on a 64-bit integer.

The following minimal program reproduces the error:

#include <CL/opencl.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>

const char source[] =
"__kernel void test_rotate(__global ulong *restrict d_count)"
"{"
"  const ulong n = 16;"
"  d_count[0] = (2004413935125273123UL << n) | (2004413935125273123UL >> (64 - n));"
"  d_count[1] = rotate(2004413935125273123UL, n);"
"}"
;

int main()
{
  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue;
  cl_program program;
  cl_mem mem;
  cl_kernel kernel;

  clGetPlatformIDs(1, &platform, NULL);
  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
  context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  queue = clCreateCommandQueue(context, device, 0, NULL);

  const char *sources[1] = {source};
  program = clCreateProgramWithSource(context, 1, sources, NULL, NULL);
  clBuildProgram(program, 1, &device, NULL, NULL, NULL);
  mem = clCreateBuffer(context, CL_MEM_READ_WRITE, 2*sizeof(cl_ulong), NULL, NULL);
  kernel = clCreateKernel(program, "test_rotate", NULL);
  clSetKernelArg(kernel, 0, sizeof(mem), &mem);

  const size_t work_size[1] = {1};
  clEnqueueNDRangeKernel(queue, kernel, 1, NULL, work_size, work_size, 0, NULL, NULL);

  cl_ulong *buf = (cl_ulong *)clEnqueueMapBuffer(queue, mem, true, CL_MAP_READ, 0, 2*sizeof(cl_ulong), 0, NULL, NULL, NULL);
  printf("expected: 0x%"PRIX64"\n", buf[0]);
  printf("rotate(): 0x%"PRIX64"\n", buf[1]);
  clEnqueueUnmapMemObject(queue, mem, buf, 0, NULL, NULL);
  return 0;
}
gcc -Wall -O2 -o rotate rotate.c -lOpenCL

The tests were run using a Tesla C2070 on CentOS 6.6 x86_64.

With NVIDIA driver version 346.59 (OpenCL 1.1 CUDA 7.0.35), rotate() returns a wrong result:

expected: 0x1BDAA9FC1A231BD1
rotate(): 0x1BDAC5CD35FDA9FC

With NVIDIA driver version 331.67 (OpenCL 1.1 CUDA 6.0.1), rotate() returns the correct result:

expected: 0x1BDAA9FC1A231BD1
rotate(): 0x1BDAA9FC1A231BD1

I also tested different bit shifts with NVIDIA driver version 346.59.

rotate() returns the wrong result for n = 1…31, while it is correct for n = 32…63.

Try testing with the newest linux x86_64 driver:

http://www.nvidia.com/Download/driverResults.aspx/84043/en-us

And you may want to file a bug at developer.nvidia.com

With the latest stable NVIDIA driver version 349.16, rotate() returns a wrong result:

expected: 0x1BDAA9FC1A231BD1
rotate(): 0x1BDAC5CD35FDA9FC

I would agree that it appears to be a bug. It was not clear to me whether you had filed a bug or not, so I have filed a bug with NVIDIA. I don’t have any further information at this time. If you desire additional, personalized communication, I suggest you file a bug at developer.nvidia.com.

I am unable to reproduce the bug on our cluster with the following configuration :

[mboisson@gpu-k20-15 tmp]$ ldd rotate
linux-vdso.so.1 => (0x00007fffe6fff000)
libOpenCL.so.1 => /usr/lib64/nvidia/libOpenCL.so.1 (0x00002b541919e000)
libc.so.6 => /lib64/libc.so.6 (0x00002b54193ac000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b5419740000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b5419944000)
/lib64/ld-linux-x86-64.so.2 (0x00002b5418f7c000)

[mboisson@gpu-k20-15 tmp]$ yum whatprovides /usr/lib64/nvidia/libOpenCL.so.1 | grep -B2 -A2 installed
nvidia-x11-drv-346.59-1.el6.elrepo.x86_64 : NVIDIA OpenGL X11 display driver
: files
Repo : installed
Matched from:
Other : Provides-match: /usr/lib64/nvidia/libOpenCL.so.1
[mboisson@gpu-k20-15 tmp]$ yum list installed | grep nvidia
kmod-nvidia.x86_64 346.59-1.el6.elrepo @elrepo
nvidia-x11-drv.x86_64 346.59-1.el6.elrepo @elrepo

[mboisson@gpu-k20-15 tmp]$ uname -a
Linux gpu-k20-15 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22 06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

[mboisson@gpu-k20-15 tmp]$ ./rotate
expected: 0x1BDAA9FC1A231BD1
rotate(): 0x1BDAA9FC1A231BD1

I tried with multiple versions of cuda (always with the most recent drivers), and the results are always the same.
We have exactly the same driver versions as the OP, with the same OS, but we have K20 cards, not Tesla C2070.

The compiler performs architecture-independent as well as an architecture-dependent transformations. It is conceivable that a code generation bug only affects a particular architecture, e.g. Fermi (C0270) but not another e.g. Kepler (K20). Filing a bug report with NVIDIA would be the recommended course of action.

I was able to reproduce the issue identified by pc (and filed a bug). The issue was also duplicated by the NVIDIA QA team, and the NVIDIA driver team has identified root cause and created a fix. The fix will be available in a future r352 or higher driver. I can’t point you to a specific r352 public driver at this time that has the fix, but I assume one should be available soon - in the next 3 months or less.

I can anecodotally confirm that on my reproducer, it fails on a cc2.0 device but passes on a cc3.5 device. Therefore I believe the explanation given by njuffa is likely. There are different code generation paths taken by the driver, and those different paths have a bearing on this issue.