Global memory access patterns - too slow

My application has to read chunks of consecutive data from various places inside an array.
When the chunks are of 128B I see excellent performance.
When the chunks are 64B the performance degrades considerably. I do not expect this as it is clearly stated that 32/64/128 byte interactions with the RAM are supported.

I have tried all of these in different combinations to no avail:

  • each thread reads 4/8/16 bytes (consecutive threads read the 64B chunk)
  • load using default/LDG/No Cache modifiers

For example:
for a 16B vector size each thread reads 16bytes. 4 consecutive threads read the 64B chunk. Other groups of 4 threads read other chunks.

What is happening here?

What GPU is being used? Various recent GPU architectures feature an L2 cache with 128-byte cache lines, partitioned into four sectors of 32 byte each. What you are observing may be the result of overfetch into L2: A request for one sector triggers the “prefetch” of additional sector(s). This can be helpful if accesses are mostly arranged in linear order, but can hurt performance if accesses are largely “random”, as data is loaded that is unlikely to be used.

I do not recall the per-architecture details; forum participants with a more in-depth knowledge of the memory hierarchy of various architectures should be able to supply the details or point to relevant documentation.

Examine the corresponding device limit: cudaLimitMaxL2FetchGranularity. For “random” access, you would want to set this as small as possible, that is, 32 bytes. Note that setting the value supplied for this limit is treated as a hint: the hardware in question may only support a limited number of values, depending on GPU architecture.

Thanks!

My current development machine (Ampere laptop compute 8.6) has a default L2 granularity of 64B. I can also change it to 32B/128B using cudaDeviceSetLimit, but results are the same.
In any case, the default 64B should be perfect for my use case.
So it seems that the L2 granularity might not be the cause after all?!

Any ideas?

Use the CUDA profiler to point you at the bottleneck(s). One issue could be the access pattern. Generally speaking, it is difficult to analyze performance issues in code one has not seen.

From personal experience: you would want large consecutive access to maximize memory throughput on the GPU. Strided access will cause throughput to drop, with performance decreasing as stride length increases until performance levels off. “Random access”, e.g. controlled by an index array, may see even lower performance.

Even though the (32-byte) sectors of the L1 and L2 can be independently loaded (in a modern GPU, e.g. Pascal+), I don’t know if the cache design is such that every sector has the possibility to map to an independent space in memory. It might be that if you use 64-byte group loads, you may still occupy a whole (128-byte) cache line from a tag lookup perspective. This would effectively cut the available cache space in half (unless you are actually loading adjacent groups), when viewed as bytes cacheable. I don’t know for certain about that.

I agree with njuffa, the profiler is a useful tool for perf analysis.

The profiler turned out to be a disappointment. It did not help to pin down the problems. The kernel is very highly optimized and uses almost all the available shared memory and registers. The time measured by the profiler is 50% higher than the actual time and the metrics does not correctly show what is happening in a regular (non profiled) run.
However, since I am only getting 15-20% performance degradation between 128B and 64B reads, the problem is probably not in the L2 granularity or prefetch overhead. So that’s good to know.
It suspect the underlying problem is that 64B reads are marginally less efficient and that other parts of my code also suffer minor penalties due to the reduced size. This might be enough to explain 15-20 percent.
Thank you for your help!

By default Nsight Compute sets the clocks to base clocks. I would recommend using an external tool to set the clocks to boost clock when doing your measurement and in Nsight Compute set --clock-control none.

For GA10x

  • Each L2 slice can perform 1 tag lookup per cycle. This is used for loads and stores.
    • Stores may be impacting your load bandwidth.
  • Each L2 slice can return 1 x 32B sector per cycle.
  • Each L2 slice is organized as 128B cache lines comprised of 4x32B sectors. On a 64B cache miss the full line is allocated but only the missed sectors (or those promoted if you have cudaLimitMaxL2FetchGranularity > 32B) will be filled.
  • The SM L1TEX cache can return 1 x 128B per cycle.
    • If the 64B access is not 64B aligned then there would be additional performance loss as you would be accessing 3 x 32B sectors or 2 x 32B sectors in differnet cache lines.
    • If each instruction does not load more than 64B then you are not fully utilizing the L1TEX to RF.

The GPU SOL breakdown for L1TEX and L2 should help identify issues.

The report and source code would likely be required to determine other optimization opportunities.