Hopper architecture Thread block reconfiguration

https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ says thread block reconfiguration can improve the performance of fft. What exactly is the thread block reconfiguration? I cannot find the details of this new feature in CUDA programming guide:( Is there any reference I can use this feature?? API or examples.

See here https://hc34.hotchips.org/assets/program/conference/day1/GPU%20HPC/HC2022.NVIDIA.Choquette.vfinal01.pdf slide 21.

You can keep data in shared memory for different kernels to access, e.g. first a kernel with many threads and 64 registers per thread, then a second kernel with 256 registers per thread.

Normally shared memory would be undefined or deleted between kernel calls.

Not sure, if this is related or independent:
https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg

BTW This year’s Hot Chips conference could also have a very interesting Blackwell presentation. Save The Date! (26th of August)

NVIDIA Blackwell GPU: Advancing Generative AI and Accelerated Computing by Ajay Tirumala (manager of SM core architecture) and Raymond Wong (director of hardware engineering), NVIDIA

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.