Work with Tall and Skinny matrices

Hi,

I have some Tall and Skinny matrices, e.g. 30Kx1K, 100Kx1K or 100Kx512.
I want to patrition this matrix and asign each partition to one SM. Lets assume we have 80 SMs and the GPU is RTX_3090.

Is below configuration correct?

The matrix is 50Kx512 and we have 80 SMs.

(50*1024)/80=640;
Also I am dividing the columns in 16 panel, so each panel has 32 columns.

So I will have 80 partitions with 640 rows and 32 columns for the first panel.

int cols = 32; 
int rows = 640; 

dim3 grid(cols, 1, 1);
dim3 threads(rows, 1, 1);

Do I have limits here for rows or columns when I am defining grid and threads for each SM?

Can I define more rows per partitons? e.g. more than 1024?
Can I define any better configuration?
Which limitations do I have when the number of rows of original matrix is larger? e.g. 200Kx1K

Well, the number of SMs is somewhat irrelevant and only dictates the number of concurrent blocks that can be executed. If the kernel uses more blocks than available resources, then a second wave of blocks will be executes once the first wave begins to free resources. Better to size kernels to the algorithm rather than the hardware, plus it will make the code more portable.

I have limits here for rows or columns when I am defining grid and threads for each SM?

If you run the utility ‘nvaccelinfo’, it will show you the block and thread size limits. For example:

Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
...
Max Threads Per SMP:           2048

A block can contain up to 1024 threads (the product of the three dimensions) with a maximum of 2048 concurrent threads per SM. Each SM can execute up to 32 concurrent blocks, so the minimum block size should be 64 threads. Though 128 is often a good choice (or 32x4 using 2 dimensions). Though the actual number is also dependent on the number of registers used per thread and shared memory usage. Aa max of 64k registers are available per SM so a max of 32 per thread to achieve 100% theoretical occupancy. Using more registers per thread will reduce the number of threads that can be run concurrently.

While this post was on OpenACC rather than CUDA Fortran, I go more in depth on the topic of occupancy in my response here: OpenACC: Fine tuning accelerator performance

Can I define more rows per partitons? e.g. more than 1024?

You can’t define more than 1024 threads per block, but could write a kernel that processes multiple rows per thread.

Can I define any better configuration?

I would suggest basing the grid and block size on the input data size and your algorithm.

For example, here’s a simple schedule which sets the number of blocks to the size of the Y dimension of the matrix, and then has 128 threads process the X dimension with each thread processing multiple elements. Note the use of the step in the do loop is equal to the block size.

% cat test1.CUF
#define NX 50000
#define NY 512
#define XBLK 128
#ifdef DEBUG
#define CheckLaunchError(s) if(cudaDeviceSynchronize().ge.0)then;\
associate(i => cudaGetLastError());\
if(i.ne.0)then;\
write(*,"('Error ',a,', File: ',a,', Line: ',i0)")s,__FILE__,__LINE__;\
write(*,"(4x,a)")cudaGetErrorString(i);\
stop;endif;\
end associate;endif
#else
#define CheckLaunchError(s)
#endif

module kernels

    use cudafor

    contains

    attributes(global) subroutine addone(adev,nx,ny)

        real(8), dimension(:,:) :: adev
        integer, value :: nx, ny
        integer :: i,j

        j = blockIdx%x
        if (j.le.ny) then
            do i=threadIdx%x,NX,blockDim%x
              adev(i,j) = adev(i,j)+1.0_8
            enddo
        endif

    end subroutine addone

end module kernels


program main
   use cudafor
   use kernels
   real(8), allocatable, dimension(:,:) :: ahost
   real(8), allocatable, dimension(:,:),device :: adev
   real(8) :: sm
   type(dim3) :: grid, block

   allocate(ahost(NX,NY))
   allocate(adev(NX,NY))
   ahost=1.0_8
   sm = sum(ahost)
   print *, 'Initial Sum: ', sm
   adev=ahost

   grid = dim3(NY,1,1)
   block = dim3(XBLK,1,1)
   call addone<<<grid,block>>>(adev,NX,NY)
   CheckLaunchError('addone')
   ahost=adev
   sm = sum(ahost)
   print *, 'Final Sum:   ', sm

end program main

% nvfortran test1.CUF ; a.out
 Initial Sum:     25600000.00000000
 Final Sum:       51200000.00000000

Or here, where each thread is processing a single element by using a multi-dimensional kernel schedule.

% cat test2.CUF
#define NX 50000
#define NY 512
#define XBLK 32
#define YBLK 4
#ifdef DEBUG
#define CheckLaunchError(s) if(cudaDeviceSynchronize().ge.0)then;\
associate(i => cudaGetLastError());\
if(i.ne.0)then;\
write(*,"('Error ',a,', File: ',a,', Line: ',i0)")s,__FILE__,__LINE__;\
write(*,"(4x,a)")cudaGetErrorString(i);\
stop;endif;\
end associate;endif
#else
#define CheckLaunchError(s)
#endif

module kernels

    use cudafor

    contains

    attributes(global) subroutine addone(adev,nx,ny)

        real(8), dimension(:,:) :: adev
        integer, value :: nx, ny
        integer :: i,j

        i = (blockIdx%x-1)*blockDim%x + threadIdx%x
        j = (blockIdx%y-1)*blockDim%y + threadIdx%y
        if (i.le.nx.and.j.le.ny) then
            adev(i,j) = adev(i,j)+1.0_8
        endif

    end subroutine addone

end module kernels

program main
   use cudafor
   use kernels
   real(8), allocatable, dimension(:,:) :: ahost
   real(8), allocatable, dimension(:,:),device :: adev
   real(8) :: sm
   type(dim3) :: grid, block

   allocate(ahost(NX,NY))
   allocate(adev(NX,NY))
   ahost=1.0_8
   sm = sum(ahost)
   print *, 'Initial Sum: ', sm
   adev=ahost

   grid = dim3((NX+XBLK-1)/XBLK,(NY+YBLK-1)/YBLK,1)
   block = dim3(XBLK,YBLK,1)
   call addone<<<grid,block>>>(adev,NX,NY)
   CheckLaunchError('addone')
   ahost=adev
   sm = sum(ahost)
   print *, 'Final Sum:   ', sm

end program main

% nvfortran test2.CUF ; a.out
 Initial Sum:     25600000.00000000
 Final Sum:       51200000.00000000

Which is better (or even a different one) will depend on your algorithm.

Which limitations do I have when the number of rows of original matrix is larger? e.g. 200Kx1K

You’ll probably run out of memory long before running out of threads. Even in the very simple second example, this would have max matrix size of 2147483647 x 65535. If you do then need to go bigger, you’d switch the kernel to process multiple elements per thread.

-Mat