GPU cache Optimizing for cachesize on Win7 and Leopard

I got a running a Meanshift Clustering code which is (on a Mac with Leopard LLVM) slower on the GPU then on the CPU.

I hoped to fix the problem with some cache optimizations using clGetDeviceInfo with the parameter CL_DEVICE_GLOBAL_CACHE_MEM_SIZE,
but unfortunately the function returns 0bytes for GPU cache.

On the CPU the optimization makes sense, because the function delivers the existing 2MB of cache of the CPU.

Don’t GPUs have a cache or is it just another missing (not yet implemented) feature?
As far as I know they have, even if it is just some kilobytes, which would help!



In the current generation of GPUs a cache is used only when accessing data from either the constant memory or from the image memory via a sampler. I think that the global memory cache that is queried by the above command refers to a cache used when you access the global memory using the buffer memory objects. In this case there is really no cache at all (see NVIDIA documentation). I think that I’ve read somewhere that some global memory cache might be included in the GPUs based on the Fermi architecture, but right now I would suggest to store your data in an image object and to access it using the sampler objects

Thank you!

I got no Fermi available (yet), so hopefully it is just another driver thing.
Let’s wait and see…

In fact there’s somtehing really close to CPU-cache, it’s the shared memory (implemented in nVidia GeForce 8xxx and later, as well as on ATI Radeon 56xx and over).

It’s 16KB of local memory shared by each group of 8 SP attached to a SM on GeForce architecture.
You should look at it on CUDA threads, and on CUDA documentation, this is not a cache, this is local memory that is far faster than the main videocard global memory, for latency and bandwidth.