Hello, I am using Libtorch2.4.0 & CUDA 11.8 to develop a deep learning project. This project include custom backward& forward CUDA functions which are written in *.cu. This project *can compile and run successfully previously but when I add more codes in .cu, the compliation stucks and never return(not err code, just stuck here). Then I use command “top”, I found there is a command “ptxas -arch sm_89 -m64 -v /tmp/tmpxft_00006356_00000000-6_main.ptx -o /tmp/tmpxft_00006356_00000000-8_main.cubin”. To make sure ptxas is the culprit, I run this command mannully(i.e. type it &run it in terminal) and it never return.
My guess is that, there maybe some limitation(register number, device code size or constant memory limit) violated? Unfortunely, there is not output from ptxas since it never return, so I have no clue how to optimize my code.
For you information, my cuda code is around 800 lines, I use cuco::static_map to accelerate my code.
I will be really appriciated if you can help, thanks in advance!