Array + Array (1D or 2D): Why is performance of my code TERRIBLE?

dwilliamsqkprj · October 17, 2024, 6:36am

Hi All,

Fair warning, my CUDA knowledge is pretty entry-level. I can get things working, and have in the past had a fair amount of success getting some very respectable speedups when porting C/C++ into CUDA.

At the moment, I’m working on a project that will involve processing a video stream. I have a proof-of-concept running on some test data, but it is slow running on CPU (single thread). I’ve therefore started trying to parallelise it using CUDA.

The issue is that it relies heavily on large quantities of adding image frames together, i.e.:
2D_SUM[width][height] = 2D_Im1[width][height] + 2D_Im2[width][height]

So far, all my attempts to get this running at an acceptable speed have been fruitless.

I am hoping someone could take a look at a minimum functional sample I wrote, and let me know why the performance I’m getting is so bad? It includes doing the sums as both 2D and equivalent 1D, running on both CPU and GPU.

I’m running it on a Jetson Xavier NX (384 Cuda Cores).

Times I’m getting on this are:

CPU 1D Arrays = 554ms
CPU 2D Arrays = 669ms
GPU 1D Arrays = 1278ms
GPU 2D Arrays = 1622ms to 4175ms (not sure if timer is reliable)

i.e. GPU is distinctly worse.

Code below should be copy/paste/compile/run-able.

#include <iostream>
#include <chrono>
#include <math.h>

using std::chrono::time_point;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;
using std::chrono::milliseconds;

class timer {
	public:
		timer(void) : start(high_resolution_clock::now()) {}
		template<class To>
		auto time_past(void) {
			auto now = high_resolution_clock::now();
			auto d = duration_cast<To>(now - start);
			return d;
		}
	private:
		time_point<high_resolution_clock> start;
};


__global__ void Cu_1D_ADD(short * CUMULATIVE_MTX, short * NEW_MTX, ushort FrameHeight, ushort FrameWidth) {

	unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
	if (Column >= FrameWidth) return;

	int Start = FrameHeight*Column;

	for (int Row = 0; Row < FrameHeight; ++Row) {
		CUMULATIVE_MTX[Start + Row] += NEW_MTX[Start + Row];
	}
}

__global__ void Cu_2D_ADD(short ** CUMULATIVE_MTX, short ** NEW_MTX, ushort FrameHeight, ushort FrameWidth) {

	unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
	if (Column >= FrameWidth) return;

	for (int Row = 0; Row < FrameHeight; ++Row) {
		CUMULATIVE_MTX[Column][Row] += NEW_MTX[Column][Row];
	}
}


int main () {

	ushort FrameWidth = 6400;
	ushort FrameHeight = 4800;

	int BlockSize = 384;
	int NumBlocks = (int)ceil( ((double)FrameWidth) / ((double)BlockSize) );

	std::cout << "NumBlocks  = " << NumBlocks << "\n\n";

	long Check;

// Define 1D Arrays
	short* CUMULATIVE_MTX_1D;
	cudaMallocManaged(&CUMULATIVE_MTX_1D, FrameWidth*FrameHeight*sizeof(short));

	short* NEW_MTX_1D;
	cudaMallocManaged(&NEW_MTX_1D, FrameWidth*FrameHeight*sizeof(short));



// Define 2D Arrays
	short** CUMULATIVE_MTX_2D;
	cudaMallocManaged(&CUMULATIVE_MTX_2D, FrameWidth*sizeof(short *));
	cudaMallocManaged(&CUMULATIVE_MTX_2D[0], FrameWidth*FrameHeight*sizeof(short));
	for (int col = 1; col < FrameWidth; ++ col) {
		CUMULATIVE_MTX_2D[col] = CUMULATIVE_MTX_2D[col-1] + FrameHeight;
	}

	short** NEW_MTX_2D;
	cudaMallocManaged(&NEW_MTX_2D, FrameWidth*sizeof(short *));
	cudaMallocManaged(&NEW_MTX_2D[0], FrameWidth*FrameHeight*sizeof(short));
	for (int col = 1; col < FrameWidth; ++ col) {
		NEW_MTX_2D[col] = NEW_MTX_2D[col-1] + FrameHeight;
	}



// Test 1D Runtime on CPU
	std::fill(CUMULATIVE_MTX_1D, CUMULATIVE_MTX_1D + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_1D, NEW_MTX_1D + FrameWidth*FrameHeight, 2);
	timer t_CPU1;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		for (int Px = 0; Px < FrameWidth*FrameHeight; ++Px) {
			CUMULATIVE_MTX_1D[Px] += NEW_MTX_1D[Px];
		}
	}
	std::cout << "CPU 1D Time = " << t_CPU1.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Px = 0; Px < FrameWidth*FrameHeight; ++Px) {
		if (!(CUMULATIVE_MTX_1D[Px] == 25)) ++Check;
	}
	std::cout << "CPU 1D [0]  = " << CUMULATIVE_MTX_1D[0] << " First Element\n";
	std::cout << "CPU 1D Sum  = " << Check << " Values Incorrect\n\n";


// Test 2D Runtime on CPU
	std::fill(CUMULATIVE_MTX_2D[0], CUMULATIVE_MTX_2D[0] + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_2D[0], NEW_MTX_2D[0] + FrameWidth*FrameHeight, 2);
	timer t_CPU2;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		for (int Column = 0; Column < FrameWidth; ++Column) {
			for (int Row = 0; Row < FrameHeight; ++Row) {
				CUMULATIVE_MTX_2D[Column][Row] += NEW_MTX_2D[Column][Row];
			}
		}
	}
	std::cout << "CPU Time 2D = " << t_CPU2.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Column = 0; Column < FrameWidth; ++Column) {
		for (int Row = 0; Row < FrameHeight; ++Row) {
			if (!(CUMULATIVE_MTX_2D[Column][Row] == 25)) ++Check;
		}
	}
	std::cout << "CPU 2D [0]  = " << CUMULATIVE_MTX_2D[0][0] << " First Element\n";
	std::cout << "CPU 2D Sum  = " << Check << " Values Incorrect\n\n";



// Test 1D Runtime on GPU
	std::fill(CUMULATIVE_MTX_1D, CUMULATIVE_MTX_1D + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_1D, NEW_MTX_1D + FrameWidth*FrameHeight, 2);
	timer t_GPU1;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		Cu_1D_ADD<<<NumBlocks,BlockSize>>>(CUMULATIVE_MTX_1D, NEW_MTX_1D, FrameHeight, FrameWidth);
		cudaDeviceSynchronize();
	}
	std::cout << "GPU Time 1D = " << t_GPU1.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Px = 0; Px < FrameWidth*FrameHeight; ++Px) {
		if (!(CUMULATIVE_MTX_1D[Px] == 25)) ++Check;
	}
	std::cout << "GPU 1D [0]  = " << CUMULATIVE_MTX_1D[0] << " First Element\n";
	std::cout << "GPU 1D Sum  = " << Check << " Values Incorrect\n\n";



// Test 2D Runtime on GPU
	std::fill(CUMULATIVE_MTX_2D[0], CUMULATIVE_MTX_2D[0] + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_2D[0], NEW_MTX_2D[0] + FrameWidth*FrameHeight, 2);
	timer t_GPU2;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		Cu_2D_ADD<<<NumBlocks,BlockSize>>>(CUMULATIVE_MTX_2D, NEW_MTX_2D, FrameHeight, FrameWidth);
		cudaDeviceSynchronize();
	}
	std::cout << "GPU Time 2D = " << t_CPU2.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Column = 0; Column < FrameWidth; ++Column) {
		for (int Row = 0; Row < FrameHeight; ++Row) {
			if (!(CUMULATIVE_MTX_2D[Column][Row] == 25)) ++Check;
		}
	}
	std::cout << "GPU 2D [0]  = " << CUMULATIVE_MTX_2D[0][0] << " First Element\n";
	std::cout << "GPU 2D Sum  = " << Check << " Values Incorrect\n\n";


	return 0;

}

Thank you for any assistance, I’m at the point where I don’t really know where to start looking for an answer.

njuffa · October 17, 2024, 6:53am

This is an integrated device where CPU and GPU use one and the same physical memory, correct?

Given that the addition of images is an activity whose performance is bound my memory bandwidth and not by computational throughput, I would not expect any performance advantage of doing this processing on the GPU instead of the CPU of this platform.

I have not looked at the code to see why the GPU version is in fact noticeably slower. There may be additional memory transfers in the GPU version, or memory access patterns that are suboptimal for the GPU, or something else. You should be able to pinpoint the bottleneck(s) with the help of the Nsight profiler. Have you tried it?

striker159 · October 17, 2024, 7:35am

For best performance you need to have coalesced memory access. Simply speaking, adjacent threads should access adjacent memory locations. This is not the case in your code.

Following 1D kernel uses coalesced accesses and improves the runtime on an RTX 4090 from 59ms to 37ms.

__global__ void Cu_1D_ADD(short * CUMULATIVE_MTX, short * NEW_MTX, ushort FrameHeight, ushort FrameWidth) {

	// unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
	// if (Column >= FrameWidth) return;

	// int Start = FrameHeight*Column;

	// for (int Row = 0; Row < FrameHeight; ++Row) {
	// 	CUMULATIVE_MTX[Start + Row] += NEW_MTX[Start + Row];
	// }

    unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
    if(Column < FrameWidth){
        for (int Row = 0; Row < FrameHeight; ++Row) {
            CUMULATIVE_MTX[Row * FrameWidth + Column] += NEW_MTX[Row * FrameWidth + Column];
        }
    }
}

Side note: Since the benchmark code uses managed memory, the runtimes on my machine include data transfer times.

(Don’t use 2D pointer of pointer approaches on the GPU unless you really have to)

dwilliamsqkprj · October 17, 2024, 10:26pm

Thanks for the response @njuffa, you are correct that the CPU and GPU should be using the same physical memory. I’ve been using nvprof, on your suggestion I’ve looked into nsys however it doesn’t look to be installed on my Jetson (It’s an Axiomtek ruggedised box, with their Jetpack build on it.)

dwilliamsqkprj · October 17, 2024, 10:43pm

Thank you @striker159, that does seem to be the issue. I had assumed that I would want to keep the memory in each kernel thread as contiguous as possible, which seems to be a very wrong assumption.

On the Jetson Xavier NX with its unified memory, the 10x loop of your 1D_ADD is running in ~80ms, much much faster than the original 1100-1200ms of mine.

I’ll try and implement this change into my main code, and will look into the performance impacts of the double pointer. This will help me determine how aggressively to push for 1D vectors.

Thanks again!

dwilliamsqkprj · October 21, 2024, 2:38am

@striker159, I can confirm that I was able to easily modify my main code (rather than the super simple representative example) in the same way that you suggested (effectively just reversing the Columns and Rows calculation) and was able to decrease the calculation speed from around 30 seconds, down to under 2 seconds.

I am going to investigate further speed improvements by looking into CUDA OpenCV where I can, but thank you again for your very simple and effective solution!

system · November 4, 2024, 2:38am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	917	October 16, 2023
Unexplained stalls in CUDA API calls - reproducer attached Jetson TK1	27	2920	October 18, 2021
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13926	September 5, 2008
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6052	November 16, 2010
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6527	February 19, 2009
Concurrent CPU and GPU processing Jetson TX1	12	1615	October 18, 2021
[Beginner]: CUDA slower than serial implementation fill Operation on entire image CUDA Programming and Performance	18	13516	September 15, 2011
Slow down with multiple CUDA files CUDA Programming and Performance	8	4716	September 7, 2010
Cuda code performance CUDA Programming and Performance	14	3083	December 16, 2014
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	669	April 4, 2017

Array + Array (1D or 2D): Why is performance of my code TERRIBLE?

Related topics