Dear all!
My problem is the following: I have a simulation running on the GPU using double buffers. I want the host application to poll the simulation arbitrarily often (not nesseccarily in every iteration). I have read the NV docs regarding OpenCL optimizations for global memory access, but they mainly talk about coalesced reads/writes. I am a lot more familiar with APP SDK and it seems kind of strange that getting cached reads are not menioned anywhere. Is this because it is so natural that all reads go through read cache using OpenCL?
The conclusion of the AMD forum thread was that optimal performance in such situations is to create both buffers READ_WRITE, and inside the kernel have the input and output declared as:
__kernel void anything(__global const restrict double* input, __global double* output, …) {…}
in addition, to make sure (I’M not quite sure if it’s nesseccary) pass the “-fno-alias” option to the compiler to enable cached reads from __global.
Once this setup is made, all one has to do is use clSetKernelArg() to swap the buffers inside the kernel. Thus one can avoid copying data from the result WRITE_ONLY buffer back to the cached READ_ONLY input buffer.
This however seems somewhat untrivial for me, and I’m quite sure that things won’t be so optimal in this setup with CUDA SDK. Setting both buffers to READ_WRITE seems somewhat strange, nonetheless since buffer flags cannot be changed at runtime, it seems logical enough. Anyhow, how would black-belt NV OCL programmers go about this problem? Is it really coalescing reads/writes all that it takes to optimize __global memory bandwidth?
Thanks in advance,
Máté
I was hoping a little more help would be given. Is there really nobody who knows how caching works on NV cards under OCL?
I was hoping a little more help would be given. Is there really nobody who knows how caching works on NV cards under OCL?
On older NV cards (compute capability 1.x) you had only caching for texture memory, which was to my knowledge only used for images. So no caching for buffers no matter what you do.
For compute capability 2.x there is a two-level cache hierarchy which is used for all memory accesses. In C4CUDA there is the possibility to bypass the L1 cache by passing a flag to the compiler (as the L1 has larger cache lines than L2, this can be advantageous for scattered reads). As it is a compiler flag, I doubt that cache usage is influenced by read/write-ability of memory flags.
I suspect that those read/writability flags play a role especially in multi-device contexts, where the runtime may have to move data between different devices.
On older NV cards (compute capability 1.x) you had only caching for texture memory, which was to my knowledge only used for images. So no caching for buffers no matter what you do.
For compute capability 2.x there is a two-level cache hierarchy which is used for all memory accesses. In C4CUDA there is the possibility to bypass the L1 cache by passing a flag to the compiler (as the L1 has larger cache lines than L2, this can be advantageous for scattered reads). As it is a compiler flag, I doubt that cache usage is influenced by read/write-ability of memory flags.
I suspect that those read/writability flags play a role especially in multi-device contexts, where the runtime may have to move data between different devices.
Thank you the answer. Is there any documentation as to which CUDA compiler flags can be used in OCL? Since both of them compile to PTX, I suspect there is some interchangeability.
So if I take it correctly it depends on compute capability as to which reads pass through cache (without explicit declaration).
Thank you the answer. Is there any documentation as to which CUDA compiler flags can be used in OCL? Since both of them compile to PTX, I suspect there is some interchangeability.
So if I take it correctly it depends on compute capability as to which reads pass through cache (without explicit declaration).