I noticed nvcc reads warpSize from a special register instead of assuming it to be 32, which keeps the compiler from some optimizations. While I understand that assuming warpSize to be 32 will raise potential portability issues, can we safely assume that warpSize is always 32 for certain CCs (e.g., sm_35)? Thanks.
warp size is 32 for all currently available CUDA GPU architectures up through cc5.2 or 5.3.
all have a warp size of 32.
All CUDA-capable GPUs that have shipped until now have a warp size of 32, so if your code does not have to be future-proof, you would be fine assuming 32 threads per warp in your code.
How much of a performance advantage at app level are you observing from using a hard-coded warp size of 32?
Your question is a good one.
Every CUDA architecture to this point has a warp size of 32.
It may be non-portable to define WARP_SIZE as 32 but I’m willing to accept the consequences so that compile-time optimizations can be made.
How large a performance difference have you observed due to the additional compiler optimizations enabled by this approach? I have never been able to get any real-life data from programmers worried about this.
I am not disputing that it may make a significant difference for some kernels, but in this age where data movement (that is, avoiding it!) is usually one of the biggest performance worries, I wonder whether this should be cataloged under “micro optimization”.
I have a pile of kernels that rely on conflict-free shared memory access patterns that are computed at compile-time and are a function of the device architecture and other parameters.
Computing this at runtime for every thread and every grid launch wouldn’t be a good idea.
Does it verge on “micro optimization”? Possibly. But I think the real answer is that it’s exposing a historical wart in the CUDA/PTX API. Kernels are compiled for “virtual architectures” but somehow warp size is not considered an architectural feature? FAIL. :)
There are any number of “historical warts” in CUDA-land, but I am not sure I understand the point about warp size not being an architectural feature. Other than exposing it via a special register, what would be needed to make it an architectural feature? Coming at this from the other side, one could also claim that warpSize = 32 is a shared architectural feature of all shipping architectures.
Ah, I was on my way to implying that WARP_SIZE should be a compile-time macro similar to CUDA_ARCH (and all the hassle that goes along with that macro).
When warpSize is used as loop bound, it can impact the loop unrolling (when #pragma unroll is indicated), which has a big impact on performance.
I assume you have already pitched the WARP_SIZE macro idea to NVIDIA?