OpenCL and local memory bank configuration

Is there a way to check what is the current local memory (i.e. shared memory) bank configuration in OpenCL? By this I mean whether successive 32-bit words or 64-bit words are assigned to successive banks. I know that in CUDA I can set the desired bank configuration using cudaDeviceSharedMemConfig() cudaDeviceSetSharedMemConfig() function. Can I access the aforementioned function through OpenCL? How?

Another related question: The CUDA programming guide tells us that in Fermi Kepler GPUs, each bank has a bandwidth of 64 bits per clock cycle. Is this also true when the local memory is in the 32-bit mode? Based on my experience, it appears that the default is 32-bit mode and each bank has a bandwidth of 32 bits per clock cycle. Have others had similar experiences? ADD: To clarify, I am trying to estimate what is the maximum theoretical local memory bandwidth and I am wondering what happens, for example, when two threads from two different warps are simultaneously accessing 32-bit words from the same memory bank.

ADD: I tried to call cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte) inside the OpenCL code and cudaDeviceGetSharedMemConfig() claimed that everything went ok. However, this had no effect on the local memory bandwidth. I am still measuring about 1250 GB/s which is less than half of what I would expect.

Just in case anyone is interested… I found a “solution” to the problem. If you have a Nvidia Tesla K40c GPU, 64-bit Linux, and you use OpenCL, then you should use the following driver: Linux x64 (AMD64/EM64T) Display Driver | 331.75 | Linux 64-bit | NVIDIA (version 331.75, 2014.5.22).

Here is why: [url]https://docs.google.com/file/d/0B7IMZIRnHA0_S3Q3Zzl2d3VDNGM/edit?usp=docslist_api[/url]

Explanation: The test program allocates ~16kB of local memory (cuda: shared memory), which means that only one work group can be active per computing unit, and each local memory access is 8 bytes (double). I compared a Nvidia GeForce GTX580 GPU against a Nvidia Tesla K40c GPU. You can clearly see that the GTX580 GPU is significantly faster when the K40c GPU is paired with driver version 340.32 (which is the latest driver for the GPU). However, the situation changes when the driver is downgraded 331.75.

I did a similar comparison using CUDA and the results are consistent with the idea that the K40c GPU is operating in 32-bit memory bank mode (cudaSharedMemBankSizeFourByte) when the driver version is 340.32 and in 64-bit memory bank mode (cudaSharedMemBankSizeEightByte) when the driver version is 331.75.