https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/ says thread block reconfiguration can improve the performance of fft. What exactly is the thread block reconfiguration? I cannot find the details of this new feature in CUDA programming guide:( Is there any reference I can use this feature?? API or examples.
See here https://hc34.hotchips.org/assets/program/conference/day1/GPU%20HPC/HC2022.NVIDIA.Choquette.vfinal01.pdf slide 21.
You can keep data in shared memory for different kernels to access, e.g. first a kernel with many threads and 64 registers per thread, then a second kernel with 256 registers per thread.
Normally shared memory would be undefined or deleted between kernel calls.
Not sure, if this is related or independent:
https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-setmaxnreg
BTW This year’s Hot Chips conference could also have a very interesting Blackwell presentation. Save The Date! (26th of August)
NVIDIA Blackwell GPU: Advancing Generative AI and Accelerated Computing by Ajay Tirumala (manager of SM core architecture) and Raymond Wong (director of hardware engineering), NVIDIA
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.