OpenCL and local memory bank configuration

mirkomyl · September 16, 2014, 3:23pm

Is there a way to check what is the current local memory (i.e. shared memory) bank configuration in OpenCL? By this I mean whether successive 32-bit words or 64-bit words are assigned to successive banks. I know that in CUDA I can set the desired bank configuration using cudaDeviceSharedMemConfig() cudaDeviceSetSharedMemConfig() function. Can I access the aforementioned function through OpenCL? How?

Another related question: The CUDA programming guide tells us that in Fermi Kepler GPUs, each bank has a bandwidth of 64 bits per clock cycle. Is this also true when the local memory is in the 32-bit mode? Based on my experience, it appears that the default is 32-bit mode and each bank has a bandwidth of 32 bits per clock cycle. Have others had similar experiences? ADD: To clarify, I am trying to estimate what is the maximum theoretical local memory bandwidth and I am wondering what happens, for example, when two threads from two different warps are simultaneously accessing 32-bit words from the same memory bank.

ADD: I tried to call cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte) inside the OpenCL code and cudaDeviceGetSharedMemConfig() claimed that everything went ok. However, this had no effect on the local memory bandwidth. I am still measuring about 1250 GB/s which is less than half of what I would expect.

mirkomyl · December 11, 2014, 8:50am

Just in case anyone is interested… I found a “solution” to the problem. If you have a Nvidia Tesla K40c GPU, 64-bit Linux, and you use OpenCL, then you should use the following driver: Linux x64 (AMD64/EM64T) Display Driver | 331.75 | Linux 64-bit | NVIDIA (version 331.75, 2014.5.22).

Here is why: [url]https://docs.google.com/file/d/0B7IMZIRnHA0_S3Q3Zzl2d3VDNGM/edit?usp=docslist_api[/url]

Explanation: The test program allocates ~16kB of local memory (cuda: shared memory), which means that only one work group can be active per computing unit, and each local memory access is 8 bytes (double). I compared a Nvidia GeForce GTX580 GPU against a Nvidia Tesla K40c GPU. You can clearly see that the GTX580 GPU is significantly faster when the K40c GPU is paired with driver version 340.32 (which is the latest driver for the GPU). However, the situation changes when the driver is downgraded 331.75.

I did a similar comparison using CUDA and the results are consistent with the idea that the K40c GPU is operating in 32-bit memory bank mode (cudaSharedMemBankSizeFourByte) when the driver version is 340.32 and in 64-bit memory bank mode (cudaSharedMemBankSizeEightByte) when the driver version is 331.75.

Topic		Replies	Views
Questions regarding the local memory CUDA Programming and Performance	1	2008	July 11, 2010
Is it possible to get detailed info about GPU memory specifically shared memory? Nsight Compute	3	672	May 17, 2023
How to find number of banks in GPU global memory? CUDA Programming and Performance	4	3394	June 12, 2016
cudaDeviceSetSharedMemConfig not working for Tesla K40m CUDA Programming and Performance	1	645	December 14, 2019
GTX 780 Ti as seen by OpenCl CUDA Programming and Performance	5	1352	November 19, 2014
OpenCL, Tesla M2070 and 32-bit address CUDA Programming and Performance	3	1273	November 22, 2012
Help to understand the frame of CUDA programming CUDA Programming and Performance	2	1420	November 30, 2014
shared memory vs local memory CUDA Programming and Performance	1	8069	December 12, 2011
Global, shared memory, latency - GPU list CUDA Programming and Performance	14	11981	June 25, 2013
Report of OpenCL CUDA Programming and Performance	6	23739	May 29, 2009

OpenCL and local memory bank configuration

Related topics