Nvfortran uses the Insight computer, and my graphics card is NVIDIA RTX A4000. The current cuda version is 11.7 and the driver is 515.105.01. I would like to check the usage of registers in the thread during runtime. What can I do
Add the flag “-gpu=ptxinfo”. This will print out the PTX info including the max register count per thread for each kernel.
For example:
% nvfortran -gpu=ptxinfo test.cuf
ptxas info : 48 bytes gmem
ptxas info : Compiling entry function ‘kernel_gpu_writepar_’ for ‘sm_80’
ptxas info : Function properties for kernel_gpu_writepar_
416 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 352 bytes cmem[0]
thank you,I have solved it.
Hello,Mat, I’m sorry to bother you. I have an issue with the number of threads, thread index, and thread task partitioning. My code snippet is as follows:
kernel:
call cycle_runoff<<<16,512>>>
Thread Index Part Code:
threadnum = blockDim%x*(blockIdx%x-1)+threadIdx%x
if(threadnum > 8712) return
if(threadnum <=4356) then
startnum = threadnum2-1;endnum = threadnum2
if(threadNum==4356) endnum = threadnum*2-1
do ix =startnum,endnum
I am using nvfortran compilation, and my question is that my current kernel function configuration is 16 * 512, with a total of 8192 threads. My computing task is 8712, and according to my thread partitioning method, I assign two computing tasks to each thread. 8712 computing tasks should only require more than 4000 threads to complete the calculation. Is the remaining 4000 threads idle after the calculation is completed? When Cuda Fortran allocates computing tasks, it calculates one task per thread. After 8192 threads calculate 8192 tasks, does the remaining 520 computing tasks not count? I have encountered a problem here and I don’t know how to allocate these 8712 computing tasks. When my thread configuration is 16 * 512 or other configurations, what is the partitioning relationship between threads and tasks? How should I allocate computing tasks? Can you help me?
If I understand correctly, you have fewer threads than iterations that need to be computed. Typically this is done by adding a loop to the kernel with a stride equal to the number of threads. This way the threads will continue to compute iterations until the max size is reached. Some threads may be inactive since they’ve exited the loop.
For example:
module foo
use cudafor
real, dimension(:), allocatable :: A
real, dimension(:), allocatable,device :: Adev
contains
attributes(global) subroutine addone(A,n)
real, dimension(:) :: A
integer, value :: n
integer :: i, idx, stride
idx = blockDim%x*(blockIdx%x-1) + threadIdx%x
stride = gridDim%x * blockDim%x
do i = idx, n, stride
A(i) = A(i) + 1.0
enddo
end subroutine
end module foo
program main
use foo
integer :: N=8712
integer :: summe
allocate(A(N))
allocate(Adev(N))
A=0.0
Adev=A
call addone<<<16,512>>>(Adev,N)
A=Adev
summe=sum(A)
print *, summe
deallocate(A)
deallocate(Adev)
end program main
Thank you,Mat.I have some other questions now, and I have searched for a lot of information that is not very accurate.Can you tell me the scheduling principle of the scheduler? And how do kernel functions allocate tasks during GPU parallel programming? Regardless of the number of thread blocks set, all thread blocks on the graphics card are used in the calculation, but the number of groups is different from the number of thread blocks in each group. When I am running my program, increasing the number of thread blocks does not always result in a significant reduction in computation time. I don’t know what is causing this.
Each multiprocessor (SM) can run up to 2048 concurrent threads. How those threads are grouped in blocks often doesn’t matter and why increasing the block size often doesn’t help. There are cases where it can, but these are often due the algorithm and use of shared memory.
What does matter is occupancy, i.e. how many of the max 2048 threads can be run concurrently. This depends on how much of the shared resources each thread uses, things like number of registers per thread, or each block uses, such as shared memory. The more resources each takes, the few number of concurrent threads that can be run.
A secondary factor is the amount of work (i.e. total number of threads) that can be performed. I believe a A4000 has 96 SMs, so to achieve 100% theoretical occupancy you need a minimum of 196,608 (2048x96) threads. You only have 8192 which is using a fraction of the device. Now getting to 100% doesn’t guarantee the best performance and often 50% is sufficient if each thread needs more resources.
While I don’t know the specifics of your code or workload, if you increase the problem size by 10-20x, it will likely take about the same amount of time to compute.
You can get more details about occupancy here: Achieved Occupancy
For the performance of kernel, I’d suggest you run your code through Nsight-Compute to get a hardware level profile. This should tell you where the bottle-necks are.
Mat, Mat, I have a question about the following code, where idx is the calculated thread index, starting from 1? If starting from 1, why not directly write 1 in the do loop?
idx = blockDim%x*(blockIdx%x-1) + threadIdx%x
stride = gridDim%x * blockDim%x
do i = idx, n, stride
Only thread 1 starts at one, thread 2 starts at 2, 3 at 3, 4 at 4, etc.
If you use 1, then all threads will start at 1.
Unless this is what you want, i.e. all threads execute all iterations of the loop? In that case, remove the stride so they all don’t jump over the same iterations.
Thank you,Mat.You have helped me solve many problems, and I really appreciate it.I’ll ask you again if I have any questions
Hello Mat, I have two questions. The first question is: I am using the nvfortran compiler, but I use print to print the subroutine inside the kernel function. Sometimes it can be printed and sometimes it cannot be printed. What is the reason for this? The second question is: according to my calculation data statistics, when the kernel function is configured to 4 * 384, 8 * 384, or 16 * 384, there will be a sudden change in time. The calculation time suddenly decreases significantly, and then the subsequent calculation time is similar. What is the reason for this?
Another question is, can NVFortran use NVIDIA Insight or other analysis tools under Linux? Can you recommend some information to me?
Without more info, I’m not sure what’s happening in your case, but in general:
- Printing from device code is limited in that you can’t use formatted writes, only the basic “print *”.
- The print buffer is fixed size so you’re limited on how much can be printed from each kernel.
- Unless guarded, every thread will print and the output streams can get mixed.
The second question is: according to my calculation data statistics, when the kernel function is configured to 4 * 384, 8 * 384, or 16 * 384, there will be a sudden change in time. The calculation time suddenly decreases significantly, and then the subsequent calculation time is similar. What is the reason for this?
I’m assuming these values are the launch configuration? i.e. a block size of 384 with 4, 8, or 16 blocks?
You are adding more parallelism by increasing the block size so it would make sense that the performance improved. With small numbers of blocks, each block does more work. With larger block sizes, there’s less work per block but more blocks can be run concurrently.
Another question is, can NVFortran use NVIDIA Insight or other analysis tools under Linux? Can you recommend some information to me?
Binaries created by Nvfortran can be profiled by Nsight-Systems or Nsight-Compute, both of which ship with the NVHPC SDK.
Mat, I would like to ask you a question. When I used NVIDIA insight systems to analyze code, my kernel function configuration was written as<<8,384>>, but what I saw in the performance analysis tool was<<17,1256>>>. My previous 8 * 256 configuration was displayed correctly in the performance analysis tool. Do you know the reason? Possible causes.
This kernel is a memset operation inserted by the compiler for something like “Array_device = 1”, it’s not a user defined kernel.
Looks like there at two kernels hidden from the view. Likely your kernel is one of therse. Click the plus sign (+) to unhide them.
Mat, I unhide it, but the three displayed ones are<<<171,256>>>,<<<1,256>>,<<<1,256>>, which are not the same as my kernel function<<8,512>>configuration. I also found that when the kernel function is configured as<<<8,256>>>,<<<8,128>>>, it can be displayed normally in the performance analysis tool. The above problem will occur when the kernel function configuration exceeds<<<*, 384>>.Under the Cuda architecture, will this situation occur in parallel? May I ask what’s going on?
If it gets displayed at the 8x256 and 8x128 launch configurations and not the 8x512, the kernel is most likely getting a launch error so not actually running.
Do you check for errors after launching the kernel, i.e. call “cudaGetLastError”?
At 8x512 you might be hitting some resource constraint limit such as shared memory or registers. For example, if your kernel uses the max 255 registers per thread, only 256 threads can be run give the total registers per multiprocessor is 64K. At 384 threads, a max of 170 registers per thread can be used.
The back-end device assembler, ptxas, does the register allocation, so to see the registers per thread, add the flag “-gpu=ptxinfo”.
I used cudagetlasterror before and after starting the kernel function, which showed no errors before starting the kernel function and “too many resources requested for launch” after startup. However, my device is NVIDIA RTX a4000, SM is 48, and the maximum number of threads per thread block is 1024. The kernel function I set is 8 * 512, which does not exceed the resource setting of the graphics card. What is the problem?
What is the problem?
Again, there are other resource limits besides the threads per block. Exceeding the shared memory (48K) or total registers (64k) per SM will be the most likely cause of the “too many resources” error.
You can try limiting the register usage via the flag “-gpu=maxregcount:N” where “N” is either 64 or 128. However, this may be detrimental to performance as it will cause register spilling to local memory.
For most codes there’s little benefit to increasing the block size beyond 128 so the better option is to simply use smaller block sizes.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.