Bug in Driver 340.X?


The newest drivers 340.X are wasting shared memory using OpenCL, I do not why. I’ll show you an example:

__kernel kernel(float* dummy)
  __local double shared_aux[128*10];
  for(int i = 0; i < 10; i++)
    shared_aux[get_local_id(0)*10+i] = 0.0;

If you execute this OpenCL kernel in a device with driver 340.X, the profiler will tell you
that this kernel uses 128 * 10 * 8 + 8 bytes (considering the double as 8 bytes). This
value is not correct, the correct value is 128 * 10 * 8.
Well, if you execute this OpenCL kernel in a device with driver 325.X, the profiler will tell you
that this kernel uses 128 * 10 * 8, the correct value.

This is a problem of performance that only happens in OpenCL (in CUDA, the bug doesn’t exist) using the newest drivers 340.X.
So, Did you know this behaviour? and for Nvidia, will you fix it in the next driver? For now, I will
keep my old driver because this ‘bug/problem/whatever’ of the 340.X driver has a lot of penalty in
several applications (mainly those that exhaust the shared memory)

Thank you in advance

PS: 128 threads per block

reminds me of the old days when CUDA was passing arguments to the kernel via shared memory (nowadays it’s done through constant memory).

sizeof(float*) would be 8 on 64 bit systems. Just sayin’… ;)

add another argument to kernel to see if it affects reported shared memory usage…


I have added another argument and it happens the same as in the previous example, correct with the 325.X (128 * 10 * 8 bytes) and wrong with the 340.X (128 * 10 * 8 + 8 bytes).

I can confirm the behavior reported above, although the given kernel cannot be compiled.

In fact, all OpenCL kernels using shared memory seem to waste resources. Please take look at the minimal one given below:

kernel void test_sm(global int* out){	
    local int sm[1024]; 	
    sm[get_local_id(0)] = 1;	
    out[get_global_id(0)] = sm[get_local_id(0)];

It does just enough for the compiler not to remove the local memory (OpenCL term) array “sm”. It should obviously use 1024 * 4 = 4096 bytes of shared memory, which, at least in case of a consumer level graphics card, is true if driver 337.88 or an older one is used.

Now, let’s take a look at driver 344.11. Shared memory usage is 4100 bytes, as reported by the compiler, e.g. in case of a GTX 470:

ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘test_sm’ for ‘sm_20’
ptxas info : Function properties for test_sm
ptxas . 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 4 registers, 4100 bytes smem, 36 bytes cmem[0]

The Problem occurs (at least) when using driver versions 340.52 and 344.11 on Windows 7 (64 Bit) or Windows 8.1 (64 Bit). I have evaluated it on the following graphics cards: Fermi (GTX 470, GTX 480), Kepler (GTX 670 and GTX Titan) and Maxwell (GTX 970).

As stated above, this seems to be true in case of all OpenCL kernels using shared memory and the performance-impact can be severe, due to the decreased number of active work-groups per SM.

Hi everybody,

NVidia confirmed me that it is not a bug, it is an extra word that the drivers r340.XX use for ‘internal use’. The problem is that, I do not know to do, if updating my Nvidia drivers or not. Because I only use OpenCL and this is not an improvement, unless for OpenCL it isn’t.

I think that we can close this post because my original problem was ‘solved’.

Thanks to all, bye.