double buffer how to do it efficiently?

Meteorhead · June 8, 2011, 2:07pm

Dear all!

My problem is the following: I have a simulation running on the GPU using double buffers. I want the host application to poll the simulation arbitrarily often (not nesseccarily in every iteration). I have read the NV docs regarding OpenCL optimizations for global memory access, but they mainly talk about coalesced reads/writes. I am a lot more familiar with APP SDK and it seems kind of strange that getting cached reads are not menioned anywhere. Is this because it is so natural that all reads go through read cache using OpenCL?

The conclusion of the AMD forum thread was that optimal performance in such situations is to create both buffers READ_WRITE, and inside the kernel have the input and output declared as:

__kernel void anything(__global const restrict double* input, __global double* output, …) {…}

in addition, to make sure (I’M not quite sure if it’s nesseccary) pass the “-fno-alias” option to the compiler to enable cached reads from __global.

Once this setup is made, all one has to do is use clSetKernelArg() to swap the buffers inside the kernel. Thus one can avoid copying data from the result WRITE_ONLY buffer back to the cached READ_ONLY input buffer.

This however seems somewhat untrivial for me, and I’m quite sure that things won’t be so optimal in this setup with CUDA SDK. Setting both buffers to READ_WRITE seems somewhat strange, nonetheless since buffer flags cannot be changed at runtime, it seems logical enough. Anyhow, how would black-belt NV OCL programmers go about this problem? Is it really coalescing reads/writes all that it takes to optimize __global memory bandwidth?

Thanks in advance,
MÃ¡tÃ©

Meteorhead · June 14, 2011, 11:09am

I was hoping a little more help would be given. Is there really nobody who knows how caching works on NV cards under OCL?

Meteorhead · June 14, 2011, 11:09am

I was hoping a little more help would be given. Is there really nobody who knows how caching works on NV cards under OCL?

philipjfry · June 15, 2011, 8:41am

On older NV cards (compute capability 1.x) you had only caching for texture memory, which was to my knowledge only used for images. So no caching for buffers no matter what you do.

For compute capability 2.x there is a two-level cache hierarchy which is used for all memory accesses. In C4CUDA there is the possibility to bypass the L1 cache by passing a flag to the compiler (as the L1 has larger cache lines than L2, this can be advantageous for scattered reads). As it is a compiler flag, I doubt that cache usage is influenced by read/write-ability of memory flags.

I suspect that those read/writability flags play a role especially in multi-device contexts, where the runtime may have to move data between different devices.

philipjfry · June 15, 2011, 8:41am

On older NV cards (compute capability 1.x) you had only caching for texture memory, which was to my knowledge only used for images. So no caching for buffers no matter what you do.

For compute capability 2.x there is a two-level cache hierarchy which is used for all memory accesses. In C4CUDA there is the possibility to bypass the L1 cache by passing a flag to the compiler (as the L1 has larger cache lines than L2, this can be advantageous for scattered reads). As it is a compiler flag, I doubt that cache usage is influenced by read/write-ability of memory flags.

I suspect that those read/writability flags play a role especially in multi-device contexts, where the runtime may have to move data between different devices.

Meteorhead · June 15, 2011, 10:18am

Thank you the answer. Is there any documentation as to which CUDA compiler flags can be used in OCL? Since both of them compile to PTX, I suspect there is some interchangeability.

So if I take it correctly it depends on compute capability as to which reads pass through cache (without explicit declaration).

Meteorhead · June 15, 2011, 10:18am

Thank you the answer. Is there any documentation as to which CUDA compiler flags can be used in OCL? Since both of them compile to PTX, I suspect there is some interchangeability.

So if I take it correctly it depends on compute capability as to which reads pass through cache (without explicit declaration).

Topic		Replies	Views
performance question CUDA Programming and Performance	9	9931	August 4, 2010
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19315	November 26, 2015
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9721	September 22, 2007
Report of OpenCL CUDA Programming and Performance	6	23739	May 29, 2009
Transferring data between devices CUDA Programming and Performance	7	5334	August 10, 2011
OpenCL auto-switch between CPU/GPU? CUDA Programming and Performance	3	4963	September 3, 2009
disable L1 cache on Fermi GPU running OpenCL CUDA Programming and Performance	9	4114	September 4, 2011
Bug in OpenCL compiler CUDA Programming and Performance	1	706	March 23, 2012
Concurrent Kernel Execution CUDA Programming and Performance	6	2156	July 13, 2011
Any reason to choose CUDA over OpenCL? CUDA Programming and Performance	27	26031	August 2, 2010

double buffer how to do it efficiently?

Related topics