I have some code that I want to make sure is protected from race conditions. Each warp functions independently, and could be thought of as its own thread block if only I could launch enough such blocks per SM. I bundle 6-11 warps into a thread block and assign each of them a unique index of various __shared__ memory arrays which they access by their warp index. In short, the procedure is:
Phase I:
Read some information from __global__ arrays and compute some numbers
Have lane 0 write to the warp’s exclusive index of one of the __shared__ memory arrays
__syncwarp(); to ensure that any future accesses to the __shared__ memory written by lane 0 will be protected from race conditions
for i = 0 to [ number of iterations indicated by memory written by lane 0 ]
… More computations
… Several __shfl_sync() operations involving the entire warp (0xffffffff in the mask)
endfor
Phase II:
Read more information from __global__ arrays and compute more numbers
Have lane 0 again write to the warp’s exclusive index of the same __shared__ memory array as above
__syncwarp(); again, for the same reasons as in Phase I
for i = 0 to [ number of iterations indicated by memory written by lane 0 ]
… More computations
… Several more __shfl_sync() operations involving the entire warp
endfor
There is also a Phase III, but in terms of the pseudocode above it is actually the same as Phases I and II. Am I safe from race conditions if I do not put additional__syncthreads();operations between Phases I and II, or II and III? I think that I should be, as the __shfl_sync() operations will ensure that any reading of the __shared__ memory will be done before warp-wide sychronizations and all threads will have to be in step, by which point they will have finished with any accesses to that memory, before lane 0 can again write to it with new information.
However, I know that __shfl_sync() does not include the same level of thread synchronization: it’s not like that intrinsic has a __syncwarp() stuck at the end of it. Specifically, writes to __shared__ and __global__ memory are not guaranteed to be complete by the time a war is allowed to continue on past a __shfl_sync() operation, whereas with __syncthreads() they are. But I think that’s the only significant distinction, and so it’s good enough to protect the reads from __shared__ memory with one __syncwarp() immediately after writing the information in question.
In your pseudocode you left out the accesses to shared memory, except for having lanes 0 write some values, which are not shown to be read back at any time.
If your warps access separate regions of shared memory (with their warp index), they need no synchronization (between warps). So __syncthreads() is never needed, __syncwarp() is enough. It is cheap. Use it wherever writes have to be visible (between writes and reads) and whenever memory locations can be overwritten again (between reads and writes = after the old value, if any, has been read to make sure it was already read by every interested thread, before writing a new value).
If you read only data from the previous phase and write at different memory locations for the next phase (or switch back and forth between two buffers), then the single __syncwarp is enough. The only exception could be that lane 0 possibly writes to its exclusive index and afterwards (before the __syncwarp) another lane of the warp, which is (very theoretically) still in the old phase, reads the new index instead of the old index. If this is an issue, I would put a __syncwarp also before writing by lane 0. It could be that this case would already be synchronized by shfl__sync. But as I said syncwarp is cheap.
Thanks Curefab; I have amended the pseudocode to show where the access is happening. The question really is whether I can rely on the warp-wide __shfl_sync() to handle the synchronization without an additional __syncwarp(). And there is no __syncthreads(), sorry–typing this late last night I got my wires crossed.
Use __syncwarp also after each loop before the lane 0 write
If this is the only access (getting the number of iterations), then put another syncwarp after the loop. That makes sure that the other threads do not read the number of iterations for the next phase, when they are still in the old phase.
On the other hand that could theoretically be prevented by __shfl_sync already in your case - but your instructions could be reordered.
The C++ programming guide says: These intrinsics do not imply a memory barrier. They do not guarantee any memory ordering.
To be safe and in general I would add a __syncwarp. I would expect the __syncwarp to not slow down your program at all.
Possibly the compute sanitizer could detect such race conditions.
Use different shared memory locations for each phase or at least 2
Alternatively just write the number of iterations into different memory cells for each phase. I think the data amount to store the number of iterations is small enough to not fill up the available shared memory. And the threads can count, in which phase they are. If you have many phases, you could switch between just two memory cells.
Compute number of iterations locally
Alternatively have each lane other than 0 do the same calculations as lane 0 to get the number of iterations. Typically the other lanes would just wait and it would take the same time to do the computations 32x in parallel.
Distribute with shuffle instead of with shared memory
Alternatively distribute the number of iterations with a shuffle instruction (broadcast from lane 0 to the other lanes) and keep it in a local variable. Then you would not have to do any added synchronization and it strains your shared memory bandwidth less - which at least for this kernel is no problem at all anyway.
You should be fine with either of those solutions.
It may be worthwhile to write the iteration limit for the second (and third) phases to separate variables in __shared__ memory. I could also put the __syncwarp() after the first phase’s loop (the iteration limit is written once and not otherwise edited during the loop). The iteration limit is not that simple to compute, but Amdahl’s law certainly applies. I think I want to compute it once rather than locally with each loop iteration.
I do have room for the additional __shared__ memory allocations–but you may be surprised at how tight all of this is. I am trying to get as close to 64 kB without going over, both for backwards compatibility to Turing, Pascal, and Maxwell cards (64 kB __shared__ plus an additional 32 kB L1) as well as conserving the free L1 space on Ampere and later cards (don’t use 64.1 kB of __shared__ memory, you’re just wasting 35.9 kB of L1 space that way). On Ampere and later, when I can allocate more than 64 kB, I do actually splurge and take some even lower-frequency use accumulators into the __shared__ space rather than leaving them as block-exclusive memory in __global__ and that takes me from ~61 kB to ~87 kB __shared__ usage. Otherwise I am putting things in __shared__ or keeping local variables on threads, all with the sense that there is a budget of 256 kB register space and 96-128 kB L1 or __shared__ space. The register space is something to think about but I try to avoid replicating a variable with low usage across threads if I can store just one value for reference in __shared__. It looks like I’ll be able to get up to 1408 threads per SM engaged with the most common form of the kernel, but if 1280 threads per SM would be just as fast I will pull some more things out of __shared__ and put them in registers.
If you have shared memory limitations, but want to also keep register pressure low, then stay with the same shared memory cell for each phase, but use syncwarp() after each loop, too. That should be enough to be correct.
I can afford 4 bytes x 44 warps per SM = 176 bytes additional __shared__ allocation. If I needed another 4 bytes for every thread it would be a different story.
BTW later cards (Turing 7.5 onwards) would also have around 64 additional uniform registers per warp for equal data for the whole warp. A typical use case are for loops and array offsets. It is meant to reduce register pressure.
The instructions are recognizable in SASS code starting with the letter U and having register names URZ, UR0, UR1, … Not sure, how to convince ptxas that a register has the same contents, if it is not obviously initalized with a constant.
This is the kind of information that makes it worthwhile to write these long-form posts. Thanks for letting me know–I will look into how to utilize this register space, because it could be very useful indeed. When I make nested loops I can often wind up with very high register usage, so it’s more than just the loop control variables, the compiler can get carried away. But if there are ways to coax it into seeing a loop control variable as constant across the warp that’s a big savings–as if the register space were expanded by 3% or so.
You are welcome.
There is not much documentation out there about this uniform datapath feature.
This presentation on pages 8 and 10 was presented by Nvidia at the Hot Chips conference in 2019: https://old.hotchips.org/hc31/HC31_2.12_NVIDIA_final.pdf
Before there also was an Nvidia near academic paper about the general concept (simulating the effect), which either was done at the same time as the architecture development or helped Nvidia with the decision to go with it.
You can only see this uniform datapath in SASS code, not in PTX code.
The uniform cores do logical and integer arithmetics (including multiplications), program flow and data loads. They are interspersed with the other instructions and run in parallel. Each SMSP has its own uniform datapath, so it is quite closely coupled (regarding instruction latency and predictability). The speed is 1 instruction in 2 clock cycles, so the same as INT32 or FP32 on Turing for a single thread (those are run with 16 instructions/cycle per SMSP, taking 2 clock cycles for a warp).
A lot of normal instructions have variants, which take an uniform registers as one direct operand or an address in constant memory indexed by an uniform register.
I’m not sure this is quite correct, assuming you mean “64 additional”, to the 256 regular ones.
The paper, “Dissecting Turing”, discusses Uniform Registers on page 32 and the following paragraph, “Regular Registers” mentions:
“We found that the cuobjdump -dump-resource-usage command
(that prints a kernel’s register usage) reports a count that includes both regu-
lar and uniform registers. The upper limit of total registers used in any CUDA
kernel is 256, unchanged from Volta.”
We confirmed this result by patching the register count in the section header of a CUDA kernel to values above 256, and determining that cuobjdump only recognizes 256 registers at most.
The instruction opcodes have different fields for normal registers and uniform registers. On the hardware they are in different files.
I would not believe that multiplexers are shared between the normal and uniform register files. But that would be the only reason I would imagine.
We could test it with a kernel, which uses 255+63 non-zero registers.