in my recent project I’m using constant memory which is accessed multiple times during one kernel launch. For my application I measured a huge performance gain and much higher cache hit-rates by using constant and texture memory for all read-only input data.
Unfortunately, the 64kB limit for the constant memory is a huge problem for the next expansions of this application so I’m thinking about alternatives. My idea is now to replace all constant memory with regular global memory and access it through the new annotated_ptr introduced with CUDA 11.5.
My questions:
Is the performance of an annotated_ptr with the persisting access property similar to the performance of constant memory?
Is constant memory always stored in the cache from the moment cudaMemcpyToSymbol is called whilst persisting global memory is only cached when it is accessed at least once?
What is the effect of the streaming access property? From the description in the code documentation I would assume that this data will not be cached or at least not be hold in the cache too long. Is this correct?
Since I didn’t use annotated_ptr so far I’m also interested in your experience with this new feature. Next week I’m going to update my system to CUDA 11.5 to run a few tests myself.
I don’t think this is documented anywhere and it will certainly be application and usage dependent. For a relatively small carveout (e.g. 64KB) on a cc8.0 or 8.6 GPU, I suspect that the performance would be similar. If instead, you did, say, a 4MB carveout, you might significantly impact the benefit the L2 cache is bringing to the “rest” of you application. There is no way to discover this from a forum post, it will require a trial.
No. The cudaMemcpyToSymbol operation does not necessarily populate the cache in any defined way. The cache for __constant__ memory is a per-SM entity, and I don’t think its size is published anywhere but I used to use 8KB per SM as a guesstimate/rule of thumb. So all 64KB are not all cached at the same time anyway. You should assume the usual cache behavior - caching based on usage/access from device code.
If you intend to use this mechanism, even through the “library interface”, you can probably learn more about it by reading the relevant section in the programming guide.
Thank you for the detailed reply and the links to the relevant documentation pages. I already expected that the answers to my questions could not be generalized and would need further tests and measurements. I didn’t know the cache load functions, and as you suggested, I will start with those.