Nvfortran 、 NVIDIA RTX A4000、registers

1799336883 · May 31, 2023, 10:51am

Nvfortran uses the Insight computer, and my graphics card is NVIDIA RTX A4000. The current cuda version is 11.7 and the driver is 515.105.01. I would like to check the usage of registers in the thread during runtime. What can I do

MatColgrove · May 31, 2023, 5:18pm

Add the flag “-gpu=ptxinfo”. This will print out the PTX info including the max register count per thread for each kernel.

For example:

% nvfortran -gpu=ptxinfo test.cuf
ptxas info : 48 bytes gmem
ptxas info : Compiling entry function ‘kernel_gpu_writepar_’ for ‘sm_80’
ptxas info : Function properties for kernel_gpu_writepar_
416 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 352 bytes cmem[0]

1799336883 · June 11, 2023, 8:17am

thank you，I have solved it.

1799336883 · June 19, 2023, 9:16am

Hello,Mat, I’m sorry to bother you. I have an issue with the number of threads, thread index, and thread task partitioning. My code snippet is as follows:
kernel：
call cycle_runoff<<<16,512>>>
Thread Index Part Code：
threadnum = blockDim%x*（blockIdx%x-1）+threadIdx%x
if(threadnum > 8712) return
if(threadnum <=4356) then
startnum = threadnum2-1;endnum = threadnum2
if(threadNum==4356) endnum = threadnum*2-1
do ix =startnum,endnum
I am using nvfortran compilation, and my question is that my current kernel function configuration is 16 * 512, with a total of 8192 threads. My computing task is 8712, and according to my thread partitioning method, I assign two computing tasks to each thread. 8712 computing tasks should only require more than 4000 threads to complete the calculation. Is the remaining 4000 threads idle after the calculation is completed? When Cuda Fortran allocates computing tasks, it calculates one task per thread. After 8192 threads calculate 8192 tasks, does the remaining 520 computing tasks not count? I have encountered a problem here and I don’t know how to allocate these 8712 computing tasks. When my thread configuration is 16 * 512 or other configurations, what is the partitioning relationship between threads and tasks? How should I allocate computing tasks? Can you help me？

MatColgrove · June 20, 2023, 5:28pm

If I understand correctly, you have fewer threads than iterations that need to be computed. Typically this is done by adding a loop to the kernel with a stride equal to the number of threads. This way the threads will continue to compute iterations until the max size is reached. Some threads may be inactive since they’ve exited the loop.

For example:

module foo
  use cudafor

  real, dimension(:), allocatable :: A
  real, dimension(:), allocatable,device :: Adev

  contains

  attributes(global) subroutine addone(A,n)

    real, dimension(:) :: A
    integer, value :: n
    integer :: i, idx, stride
    idx = blockDim%x*(blockIdx%x-1) + threadIdx%x
    stride = gridDim%x * blockDim%x
    do i = idx, n, stride
        A(i) = A(i) + 1.0
    enddo

  end subroutine

end module foo

program main
  use foo
  integer :: N=8712
  integer :: summe

  allocate(A(N))
  allocate(Adev(N))
  A=0.0
  Adev=A
  call addone<<<16,512>>>(Adev,N)
  A=Adev
  summe=sum(A)
  print *, summe
  deallocate(A)
  deallocate(Adev)

end program main

1799336883 · June 20, 2023, 5:50pm

Thank you，Mat.I have some other questions now, and I have searched for a lot of information that is not very accurate.Can you tell me the scheduling principle of the scheduler? And how do kernel functions allocate tasks during GPU parallel programming? Regardless of the number of thread blocks set, all thread blocks on the graphics card are used in the calculation, but the number of groups is different from the number of thread blocks in each group. When I am running my program, increasing the number of thread blocks does not always result in a significant reduction in computation time. I don’t know what is causing this.

MatColgrove · June 20, 2023, 6:42pm

Each multiprocessor (SM) can run up to 2048 concurrent threads. How those threads are grouped in blocks often doesn’t matter and why increasing the block size often doesn’t help. There are cases where it can, but these are often due the algorithm and use of shared memory.

What does matter is occupancy, i.e. how many of the max 2048 threads can be run concurrently. This depends on how much of the shared resources each thread uses, things like number of registers per thread, or each block uses, such as shared memory. The more resources each takes, the few number of concurrent threads that can be run.

A secondary factor is the amount of work (i.e. total number of threads) that can be performed. I believe a A4000 has 96 SMs, so to achieve 100% theoretical occupancy you need a minimum of 196,608 (2048x96) threads. You only have 8192 which is using a fraction of the device. Now getting to 100% doesn’t guarantee the best performance and often 50% is sufficient if each thread needs more resources.

While I don’t know the specifics of your code or workload, if you increase the problem size by 10-20x, it will likely take about the same amount of time to compute.

You can get more details about occupancy here: Achieved Occupancy

For the performance of kernel, I’d suggest you run your code through Nsight-Compute to get a hardware level profile. This should tell you where the bottle-necks are.

1799336883 · June 26, 2023, 7:32am

Mat, Mat, I have a question about the following code, where idx is the calculated thread index, starting from 1? If starting from 1, why not directly write 1 in the do loop?

idx = blockDim%x*(blockIdx%x-1) + threadIdx%x
stride = gridDim%x * blockDim%x
do i = idx, n, stride

MatColgrove · June 26, 2023, 3:29pm

Only thread 1 starts at one, thread 2 starts at 2, 3 at 3, 4 at 4, etc.

If you use 1, then all threads will start at 1.

Unless this is what you want, i.e. all threads execute all iterations of the loop? In that case, remove the stride so they all don’t jump over the same iterations.

1799336883 · June 27, 2023, 7:24am

Thank you,Mat.You have helped me solve many problems, and I really appreciate it.I’ll ask you again if I have any questions

1799336883 · July 4, 2023, 1:43am

Hello Mat, I have two questions. The first question is: I am using the nvfortran compiler, but I use print to print the subroutine inside the kernel function. Sometimes it can be printed and sometimes it cannot be printed. What is the reason for this? The second question is: according to my calculation data statistics, when the kernel function is configured to 4 * 384, 8 * 384, or 16 * 384, there will be a sudden change in time. The calculation time suddenly decreases significantly, and then the subsequent calculation time is similar. What is the reason for this?
Another question is, can NVFortran use NVIDIA Insight or other analysis tools under Linux? Can you recommend some information to me?

MatColgrove · July 5, 2023, 4:36pm

Without more info, I’m not sure what’s happening in your case, but in general:

Printing from device code is limited in that you can’t use formatted writes, only the basic “print *”.
The print buffer is fixed size so you’re limited on how much can be printed from each kernel.
Unless guarded, every thread will print and the output streams can get mixed.

The second question is: according to my calculation data statistics, when the kernel function is configured to 4 * 384, 8 * 384, or 16 * 384, there will be a sudden change in time. The calculation time suddenly decreases significantly, and then the subsequent calculation time is similar. What is the reason for this?

I’m assuming these values are the launch configuration? i.e. a block size of 384 with 4, 8, or 16 blocks?

You are adding more parallelism by increasing the block size so it would make sense that the performance improved. With small numbers of blocks, each block does more work. With larger block sizes, there’s less work per block but more blocks can be run concurrently.

Another question is, can NVFortran use NVIDIA Insight or other analysis tools under Linux? Can you recommend some information to me?

Binaries created by Nvfortran can be profiled by Nsight-Systems or Nsight-Compute, both of which ship with the NVHPC SDK.

See: https://developer.nvidia.com/nsight-systems

and https://developer.nvidia.com/nsight-compute

1799336883 · July 13, 2023, 1:25pm

Mat, I would like to ask you a question. When I used NVIDIA insight systems to analyze code, my kernel function configuration was written as<<8,384>>, but what I saw in the performance analysis tool was<<17,1256>>>. My previous 8 * 256 configuration was displayed correctly in the performance analysis tool. Do you know the reason? Possible causes.

MatColgrove · July 13, 2023, 3:04pm

This kernel is a memset operation inserted by the compiler for something like “Array_device = 1”, it’s not a user defined kernel.

Looks like there at two kernels hidden from the view. Likely your kernel is one of therse. Click the plus sign (+) to unhide them.

1799336883 · July 14, 2023, 2:31am

Mat, I unhide it, but the three displayed ones are<<<171,256>>>,<<<1,256>>,<<<1,256>>, which are not the same as my kernel function<<8,512>>configuration. I also found that when the kernel function is configured as<<<8,256>>>,<<<8,128>>>, it can be displayed normally in the performance analysis tool. The above problem will occur when the kernel function configuration exceeds<<<*, 384>>.Under the Cuda architecture, will this situation occur in parallel? May I ask what’s going on?

MatColgrove · July 14, 2023, 3:23pm

If it gets displayed at the 8x256 and 8x128 launch configurations and not the 8x512, the kernel is most likely getting a launch error so not actually running.

Do you check for errors after launching the kernel, i.e. call “cudaGetLastError”?

At 8x512 you might be hitting some resource constraint limit such as shared memory or registers. For example, if your kernel uses the max 255 registers per thread, only 256 threads can be run give the total registers per multiprocessor is 64K. At 384 threads, a max of 170 registers per thread can be used.

The back-end device assembler, ptxas, does the register allocation, so to see the registers per thread, add the flag “-gpu=ptxinfo”.

1799336883 · July 18, 2023, 1:13am

I used cudagetlasterror before and after starting the kernel function, which showed no errors before starting the kernel function and “too many resources requested for launch” after startup. However, my device is NVIDIA RTX a4000, SM is 48, and the maximum number of threads per thread block is 1024. The kernel function I set is 8 * 512, which does not exceed the resource setting of the graphics card. What is the problem?

MatColgrove · July 18, 2023, 2:59pm

What is the problem?

Again, there are other resource limits besides the threads per block. Exceeding the shared memory (48K) or total registers (64k) per SM will be the most likely cause of the “too many resources” error.

You can try limiting the register usage via the flag “-gpu=maxregcount:N” where “N” is either 64 or 128. However, this may be detrimental to performance as it will cause register spilling to local memory.

For most codes there’s little benefit to increasing the block size beyond 128 so the better option is to simply use smaller block sizes.

system · August 1, 2023, 2:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to calculate register resource correctly? I met strange problem on calculate resource such as re CUDA Programming and Performance	2	1278	July 27, 2010
Fewer than MaxThreads per Block Fails (Code Included) 400 threads per block fails while 300 successf CUDA Programming and Performance	9	1104	March 24, 2011
cudaFuncAttributes.maxThreadsPerBlock cudaDeviceProp.maxThreadsPerBlock mismatch problem CUDA Programming and Performance cuda	22	1576	April 8, 2021
I don't know that is the problem... CUDA Programming and Performance	11	962	May 9, 2011
Cuda code performance CUDA Programming and Performance	14	3149	December 16, 2014
[SOLVED] Concurrent Kernel Execution CUDA Programming and Performance	7	5904	May 21, 2016
Looking for kernel performance suggestions CUDA Programming and Performance	12	57	August 23, 2024
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52238	June 7, 2022
syncthreads problem I guess this is a syncthreads problem CUDA Programming and Performance	9	5130	October 12, 2008
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2270	October 12, 2021

Nvfortran 、 NVIDIA RTX A4000、registers

Related topics