Optimizing memory coalescence doesn't make my program faster

I am testing with memory coalescence on a program which applies a blur to an image. I found that the version of the program with better coalescence is actually slower.

I am testing the horizontal blur. The blur filter is symmetric, so, in computing a resulting value, it is more efficient, to reduce the amount of multiplications, to access consecutively elements which will be multiplied by the same value (the first and the last, and so on towards the middle of the filter).

I removed the filter and the multiplications to have fewer variables to deal with, but kept the access pattern the same.

If the image is in planar format (first all R values, then G, then B) memory coalescence is way better than if it is in pixel order (RGB, RGB, RGB…) because the values that are to be read at the same time by consecutive threads are consecutive in the image.

And indeed, nvprof confirms that the global memory load and store efficiencies are better.

Yet, the kernel that blurs in planar format is actually slower instead of faster.

Each value in the image is a 2-byte integer.

The code that uses planar format is:

__global__
void kernel1(unsigned short *__restrict__ result, const unsigned short *__restrict__ img, int width, int height, size_t result_pitch, size_t img_pitch, int n) {
	int i, j, z, k, l, c, m;
	z = blockIdx.x;
	i = blockIdx.y * blockDim.y + threadIdx.y;
	j = blockIdx.z * blockDim.z + threadIdx.z;
	if(i < width && j < height) {
		c = 0;
		for(k = 0; k < n >> 1; k++) {
			l = i + k - n / 2;
			m = 0;
			if(0 <= l && l < width) {
				m = img[(z * height + j) * img_pitch + l];
			}
			l = i + n - 1 - k - n / 2;
			if(0 <= l && l < width) {
				m += img[(z * height + j) * img_pitch + l];
			}
			c += m;
		}
		l = i + k - n / 2;
		if(0 <= l && l < width) {
			c += img[(z * height + j) * img_pitch + l];
		}
		result[(z * height + j) * result_pitch + i] = c / n;
	}
}

void blur(int width, int height) {
	int i;
	size_t img1_pitch, img2_pitch;
	unsigned short *__restrict__ img1, *__restrict__ img2;
	dim3 blocks(3, (width + 31) / 32, (height + 31) / 32);
	dim3 threadsPerBlock(1, 32, 32);

	cudaMallocPitch((void **)&img1, &img1_pitch, sizeof(unsigned short) * width, height * 3);
	img1_pitch /= sizeof(unsigned short);
	cudaMallocPitch((void **)&img2, &img2_pitch, sizeof(unsigned short) * width, height * 3);
	img2_pitch /= sizeof(unsigned short);
	for(i = 0; i < 1000; i++) {
		kernel1 <<<blocks, threadsPerBlock>>> (img2, img1, width, height, img2_pitch, img1_pitch, 17);
	}
	cudaFree(img1);
	cudaFree(img2);
	cudaDeviceSynchronize();
}

int main(void) {
	clock_t begin, end;
	begin = clock();
	blur(4096, 4096);
	end = clock();
	printf("Time: %f", (double)(end - begin) / CLOCKS_PER_SEC);
	return 0;
}

The code of the kernel that uses pixel-order is:

__global__
void kernel2(unsigned short *__restrict__ result, const unsigned short *__restrict__ img, int width, int height, size_t result_pitch, size_t img_pitch, int n) {
	int i, j, z, k, l, c, m;
	z = blockIdx.x;
	i = blockIdx.y * blockDim.y + threadIdx.y;
	j = blockIdx.z * blockDim.z + threadIdx.z;
	if(i < width && j < height) {
		c = 0;
		for(k = 0; k < n >> 1; k++) {
			m = 0;
			l = i + k - n / 2;
			if(0 <= l && l < width) {
				m = img[(j * img_pitch + l * 3) + z];
			}
			l = i + n - 1 - k - n / 2;
			if(0 <= l && l < width) {
				m += img[(j * img_pitch + l * 3) + z];
			}
			c += m;
		}
		l = i + k - n / 2;
		if(0 <= l && l < width) {
			c += img[(j * img_pitch + l * 3) + z];
		}
		result[(j * result_pitch + i * 3) + z] = c / n;
	}
}

void blur(int width, int height) {
	int i;
	size_t img1_pitch, img2_pitch;
	unsigned short *__restrict__ img1, *__restrict__ img2;
	dim3 blocks(3, (width + 31) / 32, (height + 31) / 32);
	dim3 threadsPerBlock(1, 32, 32);

	cudaMallocPitch((void **)&img1, &img1_pitch, sizeof(unsigned short) * width * 3, height);
	img1_pitch /= sizeof(unsigned short);
	cudaMallocPitch((void **)&img2, &img2_pitch, sizeof(unsigned short) * width * 3, height);
	img2_pitch /= sizeof(unsigned short);
	for(i = 0; i < 1000; i++) {
		kernel2 <<<blocks, threadsPerBlock>>> (img2, img1, width, height, img2_pitch, img1_pitch, 17);
	}
	cudaFree(img1);
	cudaFree(img2);
	cudaDeviceSynchronize();
}

int main(void) {
	clock_t begin, end;
	begin = clock();
	blur(4096, 4096);
	end = clock();
	printf("Time: %f", (double)(end - begin) / CLOCKS_PER_SEC);
	return 0;
}

The reason the kernel is run multiple times is that it makes the time more meaningful.

Does anyone have a clue as to why the version of the code which is worse at memory coalescence actually runs faster?

  • Your second code won’t compile. The kernel invocation does not match the declaration/prototype.
  • Please provide the GPU you are running on.
  • Please provide the actual command line you are using to compile, CUDA version, and operating system
  • Please provide the actual measurements/output.

When I fix the first item above, and run your codes with 100 iterations instead of 1000, I get 0.38 reported by the first code and 0.45 reported by the second code. When I profile the code on V100, I observe that first code (planar) kernel runs in about 1.0 ms and the 2nd code (pixel order) runs in about 1.7ms. So all of that lines up with my expectation.

I can’t explain your observation. I suspect incorrect timing methodology and/or incorrect compilation settings (e.g. compiling a debug project using -G, although in my case CUDA 11.2/V100, this doesn’t affect the relative perf; the planar code is still faster)

1 Like

Thanks, Robert.

I apologize for the mistake, I will correct it now by editing the post.

I am not compiling with -G. It’s interesting that you get different results.

I will comment again when I am able to perform more tests.

Thanks again!

I tried the two programs with CUDA 11.1 on Windows64 with a Quadro RTX 4000 (sm_75):

blur1.exe (planar): Time: 2.661000
blur2.exe (pixels): Time: 2.978000

Compiler was invoked like so: nvcc -o blur{1|2}.exe -arch=sm_75 blur{1|2}.cu.

1 Like

More data from CUDA 11.1, Windows64, Quadro P2000 (sm_61):

blur1.exe (planar): Time: 8.842000
blur2.exe (pixels): Time: 9.364000

1 Like

Thanks all.
It tursn out, I was mistaken in removing the filter values, as they seem to be the cause of the switch of performance (there was evidently some error on my part when I tested that it wasn’t).

So, this is the full script that I am using. It uses a gaussian blur.

The only line to change to switch the kernel is the one immediately after the comment.

#include <stdio.h>

void pascal(int *p, int n) {
	n--;
	p[0] = 1;
	for(int k = 0; k < (n >> 1); k++) {
		p[k + 1] = p[k] * (n - k) / (k + 1);
	}
}

__global__
void kernel1(unsigned short *__restrict__ result,const unsigned short *__restrict__ img, int width, int height, size_t result_pitch, size_t img_pitch, int n, const int *__restrict__ filter) {
	int i, j, z, k, l, c, m;
	z = blockIdx.x;
	i = blockIdx.y * blockDim.y + threadIdx.y;
	j = blockIdx.z * blockDim.z + threadIdx.z;
	if(i < width && j < height) {
		c = 0;
		for(k = 0; k < n >> 1; k++) {
			l = i + k - n / 2;
			m = 0;
			if(0 <= l && l < width) {
				m = img[(z * height + j) * img_pitch + l];
			}
			l = i + n - 1 - k - n / 2;
			if(0 <= l && l < width) {
				m += img[(z * height + j) * img_pitch + l];
			}
			c += filter[k] * m;
		}
		l = i + k - n / 2;
		if(0 <= l && l < width) {
			c += filter[k] * img[(z * height + j) * img_pitch + l];
		}
		result[(z * height + j) * result_pitch + i] = c >> (n - 1);
	}
}

__global__
void kernel2(unsigned short *__restrict__ result, const unsigned short *__restrict__ img, int width, int height, size_t result_pitch, size_t img_pitch, int n, const int *__restrict__ filter) {
	int i, j, z, k, l, c, m;
	z = blockIdx.x;
	i = blockIdx.y * blockDim.y + threadIdx.y;
	j = blockIdx.z * blockDim.z + threadIdx.z;
	if(i < width && j < height) {
		c = 0;
		for(k = 0; k < n >> 1; k++) {
			m = 0;
			l = i + k - n / 2;
			if(0 <= l && l < width) {
				m = img[(j * img_pitch + l * 3) + z];
			}
			l = i + n - 1 - k - n / 2;
			if(0 <= l && l < width) {
				m += img[(j * img_pitch + l * 3) + z];
			}
			c += filter[k] * m;
		}
		l = i + k - n / 2;
		if(0 <= l && l < width) {
			c += filter[k] * img[(j * img_pitch + l * 3) + z];
		}
		result[(j * result_pitch + i * 3) + z] = c >> (n - 1);
	}
}

void blur(int width, int height) {
	int i, *__restrict__ filter, *__restrict__ filter_d;
	size_t img1_pitch, img2_pitch;
	unsigned short *__restrict__ img1, *__restrict__ img2;
	dim3 blocks(3, (width + 31) / 32, (height + 31) / 32);
	dim3 threadsPerBlock(1, 32, 32);

	filter = (int *)malloc(sizeof(int) * 9);
	pascal(filter, 17);
	cudaMalloc((void **)&filter_d, sizeof(int) * 9);
	cudaMemcpy(filter_d, filter, sizeof(int) * 9, cudaMemcpyHostToDevice);

	cudaMallocPitch((void **)&img1, &img1_pitch, sizeof(unsigned short) * width, height * 3);
	img1_pitch /= sizeof(unsigned short);
	cudaMallocPitch((void **)&img2, &img2_pitch, sizeof(unsigned short) * width, height * 3);
	img2_pitch /= sizeof(unsigned short);
	for(i = 0; i < 1000; i++) {
		// CALL kerne1 OR kernel2
		kernel1 << <blocks, threadsPerBlock >> > (img2, img1, width, height, img2_pitch, img1_pitch, 17, filter_d);
	}
	cudaFree(img1);
	cudaFree(img2);
	cudaFree(filter_d);
	free(filter);
	cudaDeviceSynchronize();
}

int main(void) {
	clock_t begin, end;
	begin = clock();
	blur(4096, 4096);
	end = clock();
	printf("Time: %f", (double)(end - begin) / CLOCKS_PER_SEC);
	return 0;
}

So, here is the information:

  • Operating system: “CentOS Linux 7 (Core)” (running in Docker)
  • GPU: Tesla P100-PCIE-12GB
  • CUDA version: 10.1

To build I am using CMake, with the following file:

CMAKE_MINIMUM_REQUIRED (VERSION 3.16 FATAL_ERROR)

PROJECT ("myscript" LANGUAGES CXX CUDA)

INCLUDE(FindCUDA)

CUDA_SELECT_NVCC_ARCH_FLAGS(CUDA_ARCH_FLAGS)

ADD_EXECUTABLE ("myscript" "myscript.cu" )

MESSAGE($<$<COMPILE_LANGUAGE:CUDA>:${CUDA_ARCH_FLAGS}>)

TARGET_COMPILE_OPTIONS("myscript" PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:${CUDA_ARCH_FLAGS}>)
TARGET_INCLUDE_DIRECTORIES("myscript" PRIVATE ${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES})

For kernel1 the result is 4.07 and 3.70 for kernel2.

When I run your code on V100, I get an output of 1.26 with kernel1 and 1.57 with kernel2. I’m not using CMake. I’m using the following command line, CUDA 11.2:

nvcc -o test test.cu

1 Like

On my platform (CUDA 11.1, Windows64, Quadro RTX 4000) I see the following performance (fastest time of ten runs per kernel):

kernel1: Time 3.037000
kernel2: Time 3.310000

When I use the Quadro P2000, the execution times of the kernels are pretty much identical. I used the profiler to double check I am not accidentally running the wrong kernel. Best of ten runs:

kernel1: Time 10.066000
kernel2: Time 10.056000

I would suggest trying the latest CUDA version if possible to eliminate the possible influence of second-order effects from the compiler’s ordering of memory operations. If performance differences remain, try drilling down with the CUDA profiler, focusing on memory-related metrics.

1 Like

Thanks again.

I found that when compiling with nvcc directly the results are different, resulting with about 4.20 seconds for both.

I would suggest trying the latest CUDA version if possible

Unfortunately I cannot install any software on that machine.

try drilling down with the CUDA profiler, focusing on memory-related metrics.

I checked with nvprof and indeed the planar version of the program coalesces memory accesses better, as expected. It uses fewers memory transactions and has a better global memory load and store efficiency.

CMake performs the compilation and linking steps separately and internally runs the following two commands:

nvcc   -I/usr/local/cuda/targets/x86_64-linux/include  -gencode arch=compute_60,code=sm_60 -std=c++03 -x cu -c myscript.cu -o myscript.o
g++  myscript.o -lcudadevrt -lcudart_static  -L"/usr/local/cuda/targets/x86_64-linux/lib/stubs" -L"/usr/local/cuda/targets/x86_64-linux/lib" -lcudadevrt -lcudart_static -lrt -lpthread -ldl -o myscript

For some reason, this leads to faster code for both kernel version compared to compiling with nvcc myscript.cu -o myscript, but the second kernel appears faster.

Is there anything which could explain this difference?

when compiling your code with nvcc, try this:

nvcc -gencode arch=compute_60,code=sm_60 myscript.cu -o myscript

It’s possible that that could make a difference.

1 Like

Thanks.
I tried as you said and, indeed, that’s what leads to the faster performance. That command has the same result that compiling with CMake (or with two separate commands) has.

This still does not explain how it is possible for kernel2 to be faster than kernel1 (while it should be slower).

As an additional detail, here are the results of profiling, which are independent to how I compile the scripts. I modified the loop so that they only run one time.

Kernel1 has a gld_efficiency of 59.60 and gst_efficiency of 100.00%.
Kernel2 has a gld_efficiency of 28.11% and a gst_efficiency of 33.33%.

This confirms theorethical expectations.

Its evidently a code generation issue that is fairly specific to your GPU, your CUDA version, your GPU driver version, and your code. Changing GPU, CUDA version, or CUDA driver version seems to have an effect, based on the data already accumulated in this thread.

The usual advice in such scenarios is to update your CUDA and driver versions to the latest for your GPU. Since you can’t do that, the only other avenue is to begin the process of CUDA binary analysis, i.e. looking at the output of the compiler, rather than the inputs to the compiler. That might answer your question. However its somewhat tedious. Any results observed there would likely lead back to the same suggestion: update to the latest.

There are plenty of questions on various forums that discuss how to do binary analysis, and prinicipal tools needed are documented here.

Good luck!

I noticed the bulk data processed here is unsigned short, a 16-bit type.

I have not looked into this recently, but it used to be that processing data in chunks smaller than 32 bits lead to suboptimal performance, both because the load-store unit wasn’t optimized for narrow-width access and because some widening conversions (additional instructions) were needed because abstract C++ semantics require that data of integer types narrower than int gets widened to int first before entering into an expression.

For some simple image processing tasks with 8-bit channels I would in the past recommend loading and storing as packed RGBA (thus 32 bits) and then use four-lane SIMD vector intrinsics for packed processing, even though most of these intrinsics are emulated on GPUs newer than those based on the Kepler architecture.

You might want to conduct a roofline analysis to establish whether the performance of these two kernels is completely bound by memory bandwidth, or in fact bound by both memory bandwidth and instruction throughput.