Find out more opportunities for accelerating SpMM using sparse tensor cores

I am writing Kernel on A100-PCIe-40GB using Sparse Tensor Core accelerating SPMM(dense rhs matrix needs to be loaded according to an indices array and I am using a special data format to store matrix, so I can’t use cuSparseLt), and trying to make a performance comparison with cuBLAS and cuSparseLt.

I can achieve half of the performance compared to cuBLAS at best when the matrix size gets bigger(A512x5120 and B5120x1024).

After using NCU, there are some problems showed in my kernel:

  1. warp stalls can’t be fully eliminated even using async copy with double/multi stages shared memory buffer
  2. Instruction numbers are much higher compared to cublas, especially IMAD. I thought it may be due to the indirect memory access on matrix B and the complex storage format offset calculating.
  3. It seems that my kernel suffers a heavy store request on the L2 cache, especially on the “L1/TEX Store” matric.

How can I able to improve my kernel? I hope can get some advice, thanks a lot

I am also wondering why the cuBlas and cuSparseLt kernel L1 cache hit rate = 0%

I don’t know that fully eliminating warp stalls is a possible or sensible objective. You would go after reducing warp stalls if the GPU is providing evidence of being latency bound, such as a low SM utilization report coupled with a low memory utilization report from nsight compute.

But your nsight compute report seems to indicate memory utilization of 60% or higher. That would be the area to focus on: improve memory access patterns and efficiency. If you are loading things multiple time, strive to reduce that to once only. If you are storing values multiple times, strive to reduce that to once only. If you have uncoalesced access patterns, seek to reduce that.

That would normally mean that data is only being loaded once.

Hello, first of all, thank you for your reply! I still have some questions to ask you!
Memory utilization is around 60%, should I try to lower it? By the way, I’m still confused about memory utilization, L2 cache, and Pipe utilization. I’ve read this (document)[Pipe Utilization], but it’s still not very clear.
The first point is that I analyzed my memory access method. I used the LDGSTS.E.BYPASS.128 instruction to access memory for both arrays A and B. But for other auxiliary data, such as metadata in Sparse Tensor Core, etc., the total length required for each tile calculation may be only 32 int or 64 int, so each thread will use int type or int2 type asynchronous memory access. Instructions such as __pipeline_memcpy_async(dst, src, sizeof(int)). The address src I access is continuous in the memory and will not cause memory to uncoalesced. Will this method lead to inefficient memory access?
The second point is about the issue of “L1 cache hit rate=0%” in cuBLAS and cuSparseLt. My understanding is that their memory accesses are all 128-bit continuous, so L1 will be bypassed when loading data, and the same data will not be loaded repeatedly in subsequent calculations, so L1 cache hit rate = 0%.
In my kernel, all data will also only be loaded once. However, some memory access operations are not 128bit memory access (such as __pipeline_memcpy_async(dst, src, sizeof(int))). This access mode will be translated into LDGSTS.E [dst], [src] in SASS, which will cache the data into L1.
According to the load division method disclosed by cutlass, each thread block is responsible for a tile of the result matrix C, so there will be many thread blocks that will access the same metadata, which has been cached in L1. As a result, the L1 cache hit rate is not equal to 0.
Is my understanding correct?
The third point is that in fact, there is almost no operation of saving data to global memory in my kernel. Only after each thread block has calculated its responsible part, the results are written back to the global memory. Therefore, my store operations are not many.

I will list the metrics that are very different from cuSparseLt/cuBlas, point out my understanding and how to improve them, and hope you can help me clarify my misunderstanding. By the way, whether this part requires such a detailed comparison analysis, I’m not sure how much this will help improve performance. But all in all, thank you very much for helping me.

The first is the Shared Memory part.

  • There are a lot of instructions for “shared memory load”. Does this refer to the number of access instructions for loading data from shared memory to registers? This may be because a lot of data is accessed to shared memory through ldgsts and then loaded from shared memory into registers and used directly. Will directly using __ldg loading data from global memory to shared memory improve?
  • At the same time, my “Shared Load Matrix” is relatively high. This is because matrix B in my kernel cannot be reused in registers. In normal cuBlas and cuSparseLt, when calculating the continuous matrix A, matrix B does not need to be reloaded from shared memory, but it must be reloaded in my kernel. It can’t improve.
  • There are also a lot of LDGSTS instructions. This may be because each thread needs to be responsible for loading part of the auxiliary data. But compared to cuSparseLt and cuBlas, the tile size my thread block is responsible for is relatively small. I don’t know why there are so many LDGSTS.

Next is the L1/TEX cache part.

  • “Local Load (Loads from the local memory space in device memory)”, both cuBlas and cuSparseLt are zero in this line. What does this metric mean? In what condition might it occur? I didn’t see any relevant introduction in the nsight compute documentation.
  • The two metrics “Global Load To Shared Store (access/bypass)” are LDGSTS loading data from Global memory to Shared Memory. The tilesize in cuBlas/cuSparseLt is 128x64x64. There are 2x2x1 warps, each warp is responsible for the 64x64 matrix A and the 64x32 matrix B. The tilesize in my kernel is 32x64x64, so the number of instructions for LDGSTS should not be more than cuBlas and cuSparseLt. I don’t understand.
  • “Global Store/Local Store”, What does this metric mean? In what condition might it occur?

L2 Cache part

  • “L1/TEX Load”: The reason why there are many requests here may be L1 cache misses, so it may be possible to solve this problem by solving the L1 miss.
  • “L1/TEX Store”: Number of LTS requests from unit TEX for writes, are the requests generated in this part from the “global store/local store” of L1?

Finally, there is the Global Memory part:

  • Store: The number of sectors in this part far exceeds cuBlas(12) and cuSparseLt(0). I will save the results to global memory only after the calculation process is completed. While I did not save the registers first to shared memory and then to global memory in a larger memory transaction, but directly stored the results to global memory discretely from registers. Is this the reason why there are many sectors? However I mainly consider that the part of saving results may not be the bottleneck, so I have not optimized this part.
  • At the same time, I am curious why the number of sectors in which cuSparseLt saves the results to global memory is 0.

Thank you for your patience in reading this. I would be very grateful if you could answer my related questions.

A pipe is a functional unit in the SM. Pipes are listed in the SOL section of an Nsight compute report. Utilization is pretty much what it sounds like: To what extent are you making use of that functional unit, expressed as a percentage of maximum possible.

Not that I know of. It should be possible to arrange an async memcpy such that it makes efficient memory access. I don’t see anything so far in your description that would lead me to think otherwise.

There isn’t any particular connection between load size per thread, and whether or not data will be accessed again, and a 0% L1 cache hit rate. 128 bit loads are not required to witness any thing like that.

I don’t know if the L1 cache hit rate in a particular section of cutlass code is zero or non-zero. If the data is only loaded once, it will probably be zero.

Yes, those should correspond to LDS instructions in SASS, which move data from shared memory to registers.

It’s not possible to use __ldg to move data directly from global to shared. A trip through a register would be needed.

The relevant instruction is LDL. It may also occur on a generic load (LD) if the specified address is in the logical local space. I’d prefer not to give a description of the logical local space here. it is covered in the PTX guide as well as various forum threads. Here is one example. There are others. Roughly speaking, local space (or "immediate ") variables will get placed in the logical local space. For example:

int a = 5;

a is in the logical local space. Local means “thread local” i.e. private to a thread. A local load would be pulling data from the logical local space into a register.

The logical global space is also covered in the PTX guide and various forum questions. Pointers to data that you pass via kernel arguments are one example of data that lives in the logical global space. A global store is STG. A local store is STL. A generic store is ST which could be storing to either the logical global or logical local space, depending on the the address.

This is probably at the edge of my knowledge, currently. Such things are often documented, and you may be able to get context help in nsight compute by hovering your mouse over something you need a definition of. Otherwise for in-depth questions about interpreting nsight compute, I usually refer folks to the nsight compute forum. You might get additional help there.

I’m generally unable to answer questions like this without access to the code as well as access to nsight compute directly, interactively. My ability to help in a static fashion from pictures is fairly limited. Sometimes you can get additional information about data in a particular nsight compute report by studying any info or warn messages at the bottom of that report section, as well as studying the source page carefully. To get to the source page, you need to change a setting in the upper left corner of the report tab/page. This 3-part tutorial explains (in part 2) how to get there, and what you may find there.