Array + Array (1D or 2D): Why is performance of my code TERRIBLE?

Hi All,

Fair warning, my CUDA knowledge is pretty entry-level. I can get things working, and have in the past had a fair amount of success getting some very respectable speedups when porting C/C++ into CUDA.

At the moment, I’m working on a project that will involve processing a video stream. I have a proof-of-concept running on some test data, but it is slow running on CPU (single thread). I’ve therefore started trying to parallelise it using CUDA.

The issue is that it relies heavily on large quantities of adding image frames together, i.e.:
2D_SUM[width][height] = 2D_Im1[width][height] + 2D_Im2[width][height]

So far, all my attempts to get this running at an acceptable speed have been fruitless.

I am hoping someone could take a look at a minimum functional sample I wrote, and let me know why the performance I’m getting is so bad? It includes doing the sums as both 2D and equivalent 1D, running on both CPU and GPU.

I’m running it on a Jetson Xavier NX (384 Cuda Cores).

Times I’m getting on this are:

  • CPU 1D Arrays = 554ms
  • CPU 2D Arrays = 669ms
  • GPU 1D Arrays = 1278ms
  • GPU 2D Arrays = 1622ms to 4175ms (not sure if timer is reliable)

i.e. GPU is distinctly worse.

Code below should be copy/paste/compile/run-able.

#include <iostream>
#include <chrono>
#include <math.h>

using std::chrono::time_point;
using std::chrono::duration_cast;
using std::chrono::high_resolution_clock;
using std::chrono::milliseconds;

class timer {
	public:
		timer(void) : start(high_resolution_clock::now()) {}
		template<class To>
		auto time_past(void) {
			auto now = high_resolution_clock::now();
			auto d = duration_cast<To>(now - start);
			return d;
		}
	private:
		time_point<high_resolution_clock> start;
};


__global__ void Cu_1D_ADD(short * CUMULATIVE_MTX, short * NEW_MTX, ushort FrameHeight, ushort FrameWidth) {

	unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
	if (Column >= FrameWidth) return;

	int Start = FrameHeight*Column;

	for (int Row = 0; Row < FrameHeight; ++Row) {
		CUMULATIVE_MTX[Start + Row] += NEW_MTX[Start + Row];
	}
}

__global__ void Cu_2D_ADD(short ** CUMULATIVE_MTX, short ** NEW_MTX, ushort FrameHeight, ushort FrameWidth) {

	unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
	if (Column >= FrameWidth) return;

	for (int Row = 0; Row < FrameHeight; ++Row) {
		CUMULATIVE_MTX[Column][Row] += NEW_MTX[Column][Row];
	}
}


int main () {

	ushort FrameWidth = 6400;
	ushort FrameHeight = 4800;

	int BlockSize = 384;
	int NumBlocks = (int)ceil( ((double)FrameWidth) / ((double)BlockSize) );

	std::cout << "NumBlocks  = " << NumBlocks << "\n\n";

	long Check;

// Define 1D Arrays
	short* CUMULATIVE_MTX_1D;
	cudaMallocManaged(&CUMULATIVE_MTX_1D, FrameWidth*FrameHeight*sizeof(short));

	short* NEW_MTX_1D;
	cudaMallocManaged(&NEW_MTX_1D, FrameWidth*FrameHeight*sizeof(short));



// Define 2D Arrays
	short** CUMULATIVE_MTX_2D;
	cudaMallocManaged(&CUMULATIVE_MTX_2D, FrameWidth*sizeof(short *));
	cudaMallocManaged(&CUMULATIVE_MTX_2D[0], FrameWidth*FrameHeight*sizeof(short));
	for (int col = 1; col < FrameWidth; ++ col) {
		CUMULATIVE_MTX_2D[col] = CUMULATIVE_MTX_2D[col-1] + FrameHeight;
	}

	short** NEW_MTX_2D;
	cudaMallocManaged(&NEW_MTX_2D, FrameWidth*sizeof(short *));
	cudaMallocManaged(&NEW_MTX_2D[0], FrameWidth*FrameHeight*sizeof(short));
	for (int col = 1; col < FrameWidth; ++ col) {
		NEW_MTX_2D[col] = NEW_MTX_2D[col-1] + FrameHeight;
	}



// Test 1D Runtime on CPU
	std::fill(CUMULATIVE_MTX_1D, CUMULATIVE_MTX_1D + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_1D, NEW_MTX_1D + FrameWidth*FrameHeight, 2);
	timer t_CPU1;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		for (int Px = 0; Px < FrameWidth*FrameHeight; ++Px) {
			CUMULATIVE_MTX_1D[Px] += NEW_MTX_1D[Px];
		}
	}
	std::cout << "CPU 1D Time = " << t_CPU1.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Px = 0; Px < FrameWidth*FrameHeight; ++Px) {
		if (!(CUMULATIVE_MTX_1D[Px] == 25)) ++Check;
	}
	std::cout << "CPU 1D [0]  = " << CUMULATIVE_MTX_1D[0] << " First Element\n";
	std::cout << "CPU 1D Sum  = " << Check << " Values Incorrect\n\n";


// Test 2D Runtime on CPU
	std::fill(CUMULATIVE_MTX_2D[0], CUMULATIVE_MTX_2D[0] + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_2D[0], NEW_MTX_2D[0] + FrameWidth*FrameHeight, 2);
	timer t_CPU2;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		for (int Column = 0; Column < FrameWidth; ++Column) {
			for (int Row = 0; Row < FrameHeight; ++Row) {
				CUMULATIVE_MTX_2D[Column][Row] += NEW_MTX_2D[Column][Row];
			}
		}
	}
	std::cout << "CPU Time 2D = " << t_CPU2.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Column = 0; Column < FrameWidth; ++Column) {
		for (int Row = 0; Row < FrameHeight; ++Row) {
			if (!(CUMULATIVE_MTX_2D[Column][Row] == 25)) ++Check;
		}
	}
	std::cout << "CPU 2D [0]  = " << CUMULATIVE_MTX_2D[0][0] << " First Element\n";
	std::cout << "CPU 2D Sum  = " << Check << " Values Incorrect\n\n";



// Test 1D Runtime on GPU
	std::fill(CUMULATIVE_MTX_1D, CUMULATIVE_MTX_1D + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_1D, NEW_MTX_1D + FrameWidth*FrameHeight, 2);
	timer t_GPU1;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		Cu_1D_ADD<<<NumBlocks,BlockSize>>>(CUMULATIVE_MTX_1D, NEW_MTX_1D, FrameHeight, FrameWidth);
		cudaDeviceSynchronize();
	}
	std::cout << "GPU Time 1D = " << t_GPU1.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Px = 0; Px < FrameWidth*FrameHeight; ++Px) {
		if (!(CUMULATIVE_MTX_1D[Px] == 25)) ++Check;
	}
	std::cout << "GPU 1D [0]  = " << CUMULATIVE_MTX_1D[0] << " First Element\n";
	std::cout << "GPU 1D Sum  = " << Check << " Values Incorrect\n\n";



// Test 2D Runtime on GPU
	std::fill(CUMULATIVE_MTX_2D[0], CUMULATIVE_MTX_2D[0] + FrameWidth*FrameHeight, 5);
	std::fill(NEW_MTX_2D[0], NEW_MTX_2D[0] + FrameWidth*FrameHeight, 2);
	timer t_GPU2;
	for (int Repeat = 0; Repeat < 10; ++Repeat) {
		Cu_2D_ADD<<<NumBlocks,BlockSize>>>(CUMULATIVE_MTX_2D, NEW_MTX_2D, FrameHeight, FrameWidth);
		cudaDeviceSynchronize();
	}
	std::cout << "GPU Time 2D = " << t_CPU2.time_past<milliseconds>().count() << " ms\n";
	Check = 0;
	for (int Column = 0; Column < FrameWidth; ++Column) {
		for (int Row = 0; Row < FrameHeight; ++Row) {
			if (!(CUMULATIVE_MTX_2D[Column][Row] == 25)) ++Check;
		}
	}
	std::cout << "GPU 2D [0]  = " << CUMULATIVE_MTX_2D[0][0] << " First Element\n";
	std::cout << "GPU 2D Sum  = " << Check << " Values Incorrect\n\n";


	return 0;

}

Thank you for any assistance, I’m at the point where I don’t really know where to start looking for an answer.

This is an integrated device where CPU and GPU use one and the same physical memory, correct?

Given that the addition of images is an activity whose performance is bound my memory bandwidth and not by computational throughput, I would not expect any performance advantage of doing this processing on the GPU instead of the CPU of this platform.

I have not looked at the code to see why the GPU version is in fact noticeably slower. There may be additional memory transfers in the GPU version, or memory access patterns that are suboptimal for the GPU, or something else. You should be able to pinpoint the bottleneck(s) with the help of the Nsight profiler. Have you tried it?

For best performance you need to have coalesced memory access. Simply speaking, adjacent threads should access adjacent memory locations. This is not the case in your code.

Following 1D kernel uses coalesced accesses and improves the runtime on an RTX 4090 from 59ms to 37ms.

__global__ void Cu_1D_ADD(short * CUMULATIVE_MTX, short * NEW_MTX, ushort FrameHeight, ushort FrameWidth) {

	// unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
	// if (Column >= FrameWidth) return;

	// int Start = FrameHeight*Column;

	// for (int Row = 0; Row < FrameHeight; ++Row) {
	// 	CUMULATIVE_MTX[Start + Row] += NEW_MTX[Start + Row];
	// }

    unsigned int Column = blockIdx.x * blockDim.x + threadIdx.x;
    if(Column < FrameWidth){
        for (int Row = 0; Row < FrameHeight; ++Row) {
            CUMULATIVE_MTX[Row * FrameWidth + Column] += NEW_MTX[Row * FrameWidth + Column];
        }
    }
}

Side note: Since the benchmark code uses managed memory, the runtimes on my machine include data transfer times.

(Don’t use 2D pointer of pointer approaches on the GPU unless you really have to)

Thanks for the response @njuffa, you are correct that the CPU and GPU should be using the same physical memory. I’ve been using nvprof, on your suggestion I’ve looked into nsys however it doesn’t look to be installed on my Jetson (It’s an Axiomtek ruggedised box, with their Jetpack build on it.)

Thank you @striker159, that does seem to be the issue. I had assumed that I would want to keep the memory in each kernel thread as contiguous as possible, which seems to be a very wrong assumption.

On the Jetson Xavier NX with its unified memory, the 10x loop of your 1D_ADD is running in ~80ms, much much faster than the original 1100-1200ms of mine.

I’ll try and implement this change into my main code, and will look into the performance impacts of the double pointer. This will help me determine how aggressively to push for 1D vectors.

Thanks again!

@striker159, I can confirm that I was able to easily modify my main code (rather than the super simple representative example) in the same way that you suggested (effectively just reversing the Columns and Rows calculation) and was able to decrease the calculation speed from around 30 seconds, down to under 2 seconds.

I am going to investigate further speed improvements by looking into CUDA OpenCV where I can, but thank you again for your very simple and effective solution!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.