I know the direct data movement from global to shared is supported since sm80, I wonder if there is a copy from shared memory to global memory without using register?
Yes, but as far as I know, it is supported only from the Hopper architecture (sm_90) and beyond. The new feature, Tensor Memory Accelerator (TMA), includes the SASS instruction UTMASTG
which allows copying data from shared memory to global memory.
OK, got it. thank you!
Is copying from shared memory so expensive register-wise? If it needs 3 registers for indexing, you could just store 3 registers in shared memory beforehand and reload after the copy from shared to global memory.
Or better store 3 registers in local memory (or let the assembler handle it) and copy from shared memory to global memory in a streaming mode, where the L1 cache is circumvented, so that reloading from local memory is just served from L1?
An alternative would be to dedicate a warp within each block just for the copies and synchronize the other warps with it. You can devise barriers in a way that all warps except the dedicated one just go on running (finer barrier than a block-wide __syncthreads()
).
Since (or only with, because of sm90a,) Hopper it is additionally possible (with setmaxnreg
) to reduce the registers used by this dedicated warp and increase the registers of the other warps in the block to not waste the overall registers, as copying memory in a dedicated fashion probably does not need many registers.
(Or probably this register transfer has to be made in a granularity for 4 warps, 1 per SMSP.)
thank you for the explanation. I’m learning the implementation of flash attention. I find the final result is generated by tensor core, so the result is written into shared memory and then read into registers ,and write into global memory in a vectorized way. So I ask this question to see if there is a direct way from shared to global memory.
The tensor cores always output their result in registers. The writing to the shared memory is a manual one. It is possible that you use the shared memory to reorder the data. Sometimes the shuffle instructions can be used for that to ‘circumvent’ the shared memory. (Shuffle also uses some infrastructure of shared memory, but the load would be halved.) Sometimes it is possible to write to global memory without having to reorder data.
Do you have any blogs describing this?
In my knowledge, the shuffle instruction allows threads to visit registers in other thread directly.
Does it still uses shared memory?
Thank you.
I try to write some things quickly together. Sorry, if it is too confusing. :-)
Feel free to ask.
The shuffle instruction allows within one warp of 32 threads for each thread to send a value and for each thread to receive any one of the values including its own.
As can be tested with Compute Nsight, it uses some resource of the shared memory. I would naturally guess, it shares the data ports or data redistribution between the threads. (It does not use or occupy the memory itself.)
Compared to using shared memory, which you would write into first then read, you can reorder data with shuffling with one command. On the other hand, if more than 128 bytes are involved in the reordering, then you have to do some additional reordering within the threads, which is possible and fast for up to 16 or 32 bytes per thread, but can get extensive above that, as you have to program out all combinations.
The sectors of the cache are 32 bytes in size (the tags, however, are responsible for 128 bytes).
For example, you want to write to global memory with 4 * 32 bytes, i.e. you group your 32 threads=lanes of your warp into 4 groups of 8 threads. Each thread of each group will write 4 bytes of the 32 bytes per group.
Assume your calculations result in 4 bytes per thread, but each of the 32 results should go to different memory regions. The next calculation creates the next 4 bytes per each region, and so on.
That means, we have to combine 8 calculations for one full write to one sector of 32 bytes.
We do 8 matrix calculations and then groups of 8 threads have to work together to reshuffle their data between them, so they can do their 8 memory transactions.
That is a lot of shuffling, so better use 64-bit or 128-bit memory transactions and use groups of 4 or 2 threads.
2 Threads:
Do 8 calculations each.
Shuffle, so that the first thread sends its calculations 5-8 over and the second thread sends its results 1-4 back (within the same shuffle command).
Write the computation results of the first thread (1-4 written by thread 1; 5-8 written by thread 2, which was sent over)
Write the computation results of the second thread (1-4 sent over, written by thread 1; 5-8 written by thread 2)
Perhaps with newer architectures, you can even do (needs less data shuffling around within the threads, but writes in non-consecutive order):
Do 8 calculations each.
Shuffle, so that the first thread sends its calculations 5-8 over and the second thread sends its results 5-8 back.
Write the computation results of the first thread (1-4 written by thread 1; 5-8 written by thread 2, which was sent over)
Write the computation results of the second thread (5-8 sent over, written by thread 1; 1-4 written by thread 2)
Each thread writes 4 32-bit values, which correspond to 128-bit (vector, e.g. int4, float4) transfer sizes.
Instead of writing and reading each data packet once, we shuffled half of the data and kept the other half within the threads, who would do the writing. So we saved the shared memory bandwidth by a factor of 4x!
That has use in practice, only if you are limited here, either, because you are either using the shared memory extensively for other purposes, or because you are writing (or reading) a lot of data that way (with shuffling) and it mostly stays in L1 or L2 (as global device memory is slow enough that you are limited by its speed than the shared memory bandwidth).
The shared memory is shared by all SMPs (SM Partitions), if all warps use it extensively (e.g. all your warps and kernels running at the same time the same function), it can be the limiting factor.