Optimization suggestions for reading from main memory to registers and share memory

I’m working on optimizing a CUDA kernel, and I would appreciate any advice or suggestions on how to improve its performance. Below is the current implementation of the kernel:

template<int N>
        __device__
void bl3_sh_device( double  *dA, int lda)
{

const int tx = threadIdx.x;
const int gtx = blockIdx.x * blockDim.x + tx;

double rA[N];
double reg;
__shared__ double s_y[N];

// Load A into registers
#pragma unroll
for (int i = 0; i < N; i++) {
        rA[i] = dA[gtx + i * lda];
}

#pragma unroll
for (int i = 0; i < N; i++) {
        if (gtx == i) {
#pragma unroll
                for (int j = i; j < N; j++) {
                        s_y[j] = rA[j];
                }
        }
        __syncthreads();

        reg = 1/s_y[i];

        if (gtx > i) {
                rA[i] *= reg;
#pragma unroll
                for (int j = i + 1; j < N; j++) {
                        rA[j] -= rA[i] * s_y[j];
                }
        }
        __syncthreads();
}
// Store back to global memory
#pragma unroll
for (int i = 0; i < N; i++) {
        dA[gtx + i * lda] = rA[i];
}
}

My main questions are:

  1. How can I optimize the parallel reading from global memory to registers to improve memory throughput?
  2. Are there any specific strategies to better utilize shared memory in this context?
  3. How can I further optimize the use of registers?
  4. Are there any potential pitfalls or inefficiencies in the way synchronization is currently being handled?
  5. Would it be beneficial to re-structure the kernel in a way that could reduce the number of synchronization points?
  6. Do we have any other kind of memory which would be utilizable for this kernel?

Thank you in advance for your insights and recommendations!

Have you profiled the code yet? What are the bottlenecks?

What is the value of N ?

Can you describe your code? What does it do at the higher level?

I do not have root access for profiling, only Nsight systems which do not show enough information.
N is the number of the column in the matrix, and this code does LU decomposition at a higher level.

Could you share Compute Nsight data.

Also, with what parameters do you call the kernel (best as fully working minimal-viable example), what is the value of N?

Which GPU are you running this on?

Have you confirmed that no local memory is used. That reads from and writes to global memory are optimal.

Do you ask for other memory for its size (so you can increase the working size) or for possible performance?

You have a lot of __syncthreads() in your code. It could make sense to increase the number of blocks per SM as much as possible to reduce the potential for blocking. Disadvantage would be that the amount of shared memory per block would be reduced. But as you hold the data mostly in registers that could be fine.

gtx seems to go beyond block sizes, but compared with i. Does your algorithm still work well? As shared memory and __syncthreads() is not shared by the blocks?

That looks inefficient, loading shared memory from a single thread. At a minimum, you could use an entire warp to do so, adding a shuffle op to broadcast values from the thread that holds the patch-column to the other threads. This could cut your loop iterations by a factor of 32. A more involved suggestion could possibly be provided with a more complete example (e.g. actual values or ranges of N, along with actual values or ranges of blockIdx.x).

And if N is reasonably large, say bigger than 64, its quite possible that your supposed use of “registers” is not doing anything at all like what you think, nor anything at all useful, but instead simply thrashing global and local memory for no benefit. Or the compiler may have completely circumvented that code. Difficult to say without an actual test case to inspect.

@Robert_Crovella @striker159 @Curefab

Thanks for your response. Here is a driver to launch the code, which will help you better understand how it works.

In the current code, the size is 8x8, which you can change. I need to execute it for matrices with sizes larger than 32, such as 128 or even larger, but fast execution for size 32 is also appreciated.

Unfortunately I do not have access to run Nsight Compute and I am running it on V100.

#include <cuda_runtime.h>
#include <iostream>
#include <vector>

inline int ceildiv(int a, int b) {
	return (a + b - 1) / b;
}



template<int N>
	__device__
void scalger_device(int m, double*dA, int lda, int *info)
{
	const int tx = threadIdx.x;
	const int gtx = blockIdx.x * blockDim.x + tx;

	double rA[N];
	double reg;

	__shared__ double s_y[N];

	// Load A into registers

#pragma unroll
	for (int i = 0; i < N; i++) {
		rA[i] = dA[gtx + i * lda];
	}

#pragma unroll
	for (int i = 0; i < N; i++) {
		if (gtx == i) {

#pragma unroll
			for (int j = i; j < N; j++) {
				s_y[j] = rA[j];
			}
		}
		__syncthreads();

		reg = 1/s_y[i];

		if (gtx > i) {
			rA[i] *= reg;
#pragma unroll
			for (int j = i + 1; j < N; j++) {
				rA[j] -= rA[i] * s_y[j];
			}
		}
		__syncthreads();
	}

	// Store to global memory
#pragma unroll
	for (int i = 0; i < N; i++) {
		dA[gtx + i * lda] = rA[i];

	}

}


template<int N>
	__global__
void scalger_kernel_native( int m,
		double *dA, int lda,
		int *info)
{
	scalger_device<N>(m, dA, lda, info);
}


int scalger_native(
		int m, int n,
		double *dA, int lda,
		int *info,
		cudaStream_t stream)
{
	if (n == 0) return 0;

	const int tbx = m;
	dim3 threads(tbx, 1, 1);
	dim3 grid(ceildiv(m, tbx), 1, 1);

	cudaError_t err;
	switch(n) {
		case 8:
			scalger_kernel_native<8><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 16:
			scalger_kernel_native<16><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 32:
			scalger_kernel_native<32><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 64:
			scalger_kernel_native<64><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 128:
			scalger_kernel_native<128><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 256:
			scalger_kernel_native<256><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 512:
			scalger_kernel_native<512><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;
		case 1024:
			scalger_kernel_native<1024><<<grid, threads, 0, stream>>>(m, dA, lda, info);
			break;

	}

	err = cudaGetLastError();
	if (err != cudaSuccess) {
		*info = err;
		return -1;
	}

	return 0;
}

int main() {

	//Size of the Matrix for test 
	int m = 8;
	int n = 8;
	int lda = m;
	int info;

	//Set dbg=1 to see the Matrix before and after of computation 
	int dbg=1;
	// Create an 8x8 matrix and initialize it
	std::vector<double> hA(m * n, 0.0);
	if (dbg)std::cout << "Matrix before computation:" << std::endl;
	for (int i = 0; i < m; ++i) {
		for (int j = 0; j < n; ++j) {
			hA[i + j * lda] = i + j+10;
			if (i==j){
				hA[i + j * lda]+=200;
			}
			if (dbg) std::cout << hA[i + j * lda] << " ";
		}
		if (dbg) std::cout << std::endl;
	}

	double *dA;

	// Allocate device memory
	cudaMalloc((void**)&dA, m * n * sizeof(double));

	// Copy matrix to device
	cudaMemcpy(dA, hA.data(), m * n * sizeof(double), cudaMemcpyHostToDevice);

	// Create a CUDA stream
	cudaStream_t stream;
	cudaStreamCreate(&stream);

	// Run the function
	scalger_native(m, n, dA, lda, &info, stream);

	// Synchronize and destroy the stream
	cudaStreamSynchronize(stream);
	cudaStreamDestroy(stream);

	// Copy result back to host
	cudaMemcpy(hA.data(), dA, m * n * sizeof(double), cudaMemcpyDeviceToHost);

	// Free device memory
	cudaFree(dA);

	if(dbg){
		std::cout << "Matrix after computation:" << std::endl;
		for (int i = 0; i < m; ++i) {
			for (int j = 0; j < n; ++j) {
				std::cout << hA[i + j * lda] << " ";
			}
			std::cout << std::endl;
		}
	}

	return 0;
}

scalger.zip (1.4 KB)

Your kernel runs on a L4 GPU in less than 5 microseconds. (And you are launching 1 block of 8 threads…) An optimization activity seems unusual in that light. What do you hope to gain? The launch overhead for a kernel is on the order of 5 microseconds. That means that if you reduced this kernel runtime to zero, you could hope for at most a 2x speedup in application performance.

Of course you will not achieve a 2x speedup. No matter what you do to this kernel, you are unlikely to witness anything interesting for the 8x8 case.

I notice your test harness seems to cover up to 1024x1024. Are you still planning to launch a kernel of a single block in that case (which is the way it appears)?

You’re not going to get interesting performance out of a GPU with that strategy.

Speaking for me, personally, there is no way I could offer advice on optimizing a microscopically small kernel. I would have no way of demonstrating that anything I suggested would be meaningful. And I think that there might be indications that your algorithm is suspect for deployment on a GPU. But I don’t really know what you are setting out to accomplish.

As an aside, you might wish to compile the code you have shown with -Xptxas=-v. Unless you include a switch like -maxrregount=64, there is no way you will ever successfully launch that kernel in the 1024 or 512 case, according to my testing on CUDA 12.2, cc8.9

Thank you, Robert. The 8x8 example was simply to demonstrate the output, but the minimum size required is 32x32.

Even 10% or 20% improvement is also perfect for me. Since I am doing this computation many times.

Why is it unable to run in the case of 512?

The question is, if you have a single matrix to calculate at a time, or like a list of 1 million matrices of that size.

I have many matrices to calculate but not 1 million.

You should probably run the test I indicated. If the report is more than 128 registers for that kernel, the GPU SM will not have enough space to launch 512 threads. This topic is covered in many forum posts; a bit of searching perhaps on the error string “too many resources requested for launch” will give you additional reading.