Hi cuda developers,
in my recent project I’m using constant memory which is accessed multiple times during one kernel launch. For my application I measured a huge performance gain and much higher cache hit-rates by using constant and texture memory for all read-only input data.
Unfortunately, the 64kB limit for the constant memory is a huge problem for the next expansions of this application so I’m thinking about alternatives. My idea is now to replace all constant memory with regular global memory and access it through the new
annotated_ptr introduced with CUDA 11.5.
Is the performance of an
annotated_ptr with the
persisting access property similar to the performance of constant memory?
Is constant memory always stored in the cache from the moment
cudaMemcpyToSymbol is called whilst
persisting global memory is only cached when it is accessed at least once?
What is the effect of the
streaming access property? From the description in the code documentation I would assume that this data will not be cached or at least not be hold in the cache too long. Is this correct?
Since I didn’t use
annotated_ptr so far I’m also interested in your experience with this new feature. Next week I’m going to update my system to CUDA 11.5 to run a few tests myself.
Before going down the path you suggest, if it were me, I would first try to see if I can get most of the benefit by decorating the pointer with
const __restrict__ and/or take advantage of the read-only cache load functions that are described in the CUDA programming guide. The read-only cache is a separate entity in the GPU memory hierarchy i.e. distinct from the L2 .
I don’t think this is documented anywhere and it will certainly be application and usage dependent. For a relatively small carveout (e.g. 64KB) on a cc8.0 or 8.6 GPU, I suspect that the performance would be similar. If instead, you did, say, a 4MB carveout, you might significantly impact the benefit the L2 cache is bringing to the “rest” of you application. There is no way to discover this from a forum post, it will require a trial.
cudaMemcpyToSymbol operation does not necessarily populate the cache in any defined way. The cache for
__constant__ memory is a per-SM entity, and I don’t think its size is published anywhere but I used to use 8KB per SM as a guesstimate/rule of thumb. So all 64KB are not all cached at the same time anyway. You should assume the usual cache behavior - caching based on usage/access from device code.
I believe the expectation here is that these will be preferentially evicted.
If you intend to use this mechanism, even through the “library interface”, you can probably learn more about it by reading the relevant section in the programming guide.
For all Compute Capabilities (except 6.0 @ 4kB), 8kB is correct. Table 15 “Cache working set per SM for constant memory” here: Programming Guide :: CUDA Toolkit Documentation
Thank you for the detailed reply and the links to the relevant documentation pages. I already expected that the answers to my questions could not be generalized and would need further tests and measurements. I didn’t know the cache load functions, and as you suggested, I will start with those.