I noticed nvcc reads warpSize from a special register instead of assuming it to be 32, which keeps the compiler from some optimizations. While I understand that assuming warpSize to be 32 will raise potential portability issues, can we safely assume that warpSize is always 32 for certain CCs (e.g., sm_35)? Thanks.
All CUDA-capable GPUs that have shipped until now have a warp size of 32, so if your code does not have to be future-proof, you would be fine assuming 32 threads per warp in your code.
How much of a performance advantage at app level are you observing from using a hard-coded warp size of 32?
How large a performance difference have you observed due to the additional compiler optimizations enabled by this approach? I have never been able to get any real-life data from programmers worried about this.
I am not disputing that it may make a significant difference for some kernels, but in this age where data movement (that is, avoiding it!) is usually one of the biggest performance worries, I wonder whether this should be cataloged under “micro optimization”.
I have a pile of kernels that rely on conflict-free shared memory access patterns that are computed at compile-time and are a function of the device architecture and other parameters.
Computing this at runtime for every thread and every grid launch wouldn’t be a good idea.
Does it verge on “micro optimization”? Possibly. But I think the real answer is that it’s exposing a historical wart in the CUDA/PTX API. Kernels are compiled for “virtual architectures” but somehow warp size is not considered an architectural feature? FAIL. :)
There are any number of “historical warts” in CUDA-land, but I am not sure I understand the point about warp size not being an architectural feature. Other than exposing it via a special register, what would be needed to make it an architectural feature? Coming at this from the other side, one could also claim that warpSize = 32 is a shared architectural feature of all shipping architectures.
Ah, I was on my way to implying that WARP_SIZE should be a compile-time macro similar to CUDA_ARCH (and all the hassle that goes along with that macro).