How to achieve high utilization?

wtiandong · March 20, 2018, 11:38am

Hi all,
I’m working on an image processing kernel for neural network called im2col. What it does is just copy the image data to shared memory and then copy back to global memory with another layout format for matrix multiplication. I use the nvvp to profile my kernel, nvvp tells me that the utilization of my kernel is low, the compute and memory utilization is just 25%.

shared memory:
shared loads 17.5 GB/s
shared stores 2.0 GB/s

L2 Cache:
Reads: 2.01 GB/s
Writes 20.651 GB/s

Unified Cache:
local loads/stores 0 B/s
Global loads: 1.97 GB/s
Global stores: 20.6 GB/s
Texture reads: 1.971 GB/s

I compared it with an open source implementation, which does not use shared memory(I think it access global memory redundantly, so I add shared memory to reduce access global memory).

But the open source implementation can achieve 85% instruction utilization and 95% memory utilization, however the memory bandwidth is low:
Shared memory is 0 B/s because no shared mem.
L2 Cache is 3.6 GB/s read and 3.2 GB/s write, which is lower than mine.
Unified cache:
global loads: 3.6 GB/s
global stores: 3.2 GB/s
Texture reads: 3.1 GB/s

my configuration is <<<512, 512>>>, the open source implementation is also <<<512, 512>>> configuration.

how to achieve high utilization?

Here’s my code:

__global__
void im2col_gpu_kernel_inference_3x3(const float *im, const int height, const int width,
		const int ksize, const int pad, int stride, int height_col, int width_col, int dataPerChannel,
		float *data_col){

	extern __shared__ float pixel_line[];
	int index = (blockIdx.x) * blockDim.x + threadIdx.x;
	int blockInnerIndex = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;
	pixel_line[threadIdx.x] = im[index];
	__syncthreads();
	int h_Index = blockIdx.x;
	int block_mat_row = min(blockIdx.x + ksize - 1, ksize - 1);
	int block_mat_col = h_Index - block_mat_row;

	for(int i = 0;
			i < 3 && block_mat_row >= 0 && block_mat_col < height_col;
			block_mat_row--, block_mat_col++, i++) {
		float *im_col_ptr = data_col
				+ __mul24(__mul24(block_mat_row , ksize), dataPerChannel)
				+ __mul24(block_mat_col, width_col);
#pragma unroll
		for(int j = 0; j < 3; j++) {
			float *im_col_ptr_current = im_col_ptr + __mul24(j, dataPerChannel);
			if(threadIdx.x < width_col)
				im_col_ptr_current[threadIdx.x] = pixel_line[threadIdx.x + j];
		}
	}
}

int main() {

// some initialization code
        int height = 512;
        int width = 512
        dim3 gDim(height, 1, 1);
	dim3 bDim(width, 1, 1);
        size_t sharedMemSize = width * sizeof(float);

	im2col_gpu_kernel_inference_3x3<<<gDim, bDim, sharedMemSize>>>(
                im, height, width, ksize, pad,
                stride, height_col,
                width_col, dataPerChannel, data_col);

}

cbuchner1 · March 20, 2018, 1:47pm

the use of __mul24() is strongly discouraged on all platforms > compute capability 1.x, as it requires expensive emulation using multiple instructions. This may have been a valid optimization technique a decade ago, but nowadays it certainly is not.

I would try going through the texture cache, in the simplest case by declaring const float *im as const restrict float *im

BulatZiganshin · March 21, 2018, 9:19pm

you got slow utilization most probably because there is limited amount of shared memory, so it can be increased by reducing amount of sharedmem used.

but you don’t need to do it if your real goal is to improve performance. instead, look into the ways to improve parallelism of memory access. in particular, you can read data into registers much ahead of storing them into shared memory - GPU continues to execute program over load-in-the-fly as far as this register contents isn’t yet used. By immediate storing data into shmem you just lose this parallelization facility. For storing data in registers, use some local variables, like that:

l1 = a[i]
b[i] = l2
l2 = a[i+N]
b[i+N] = l1
…
// here you copy data from a to b but perform two operations simultaneously, l1 and l2 will be placed to registers by compiler

Another possibility is to avoid using sharedmem completely, replacing it with registers and using shuffle operations to rearrange data. It’s very good technique because

register file is 2x-4x larger than shredmem on nvidia gpus
delay of shuffle operations is an order of magnitude smaller than delay of shmem load operation

Read a classic http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

wtiandong · April 2, 2018, 2:23am

Thank you cbuchner1, but I tried const restrict float *im, but nvcc reports error: “restrict” is not allowed.

Thank you BulatZiganshin, I’ll try it and post the result when I finish it. Thank you.

Robert_Crovella · April 2, 2018, 2:10pm

The restrict needs to be placed between the pointer asterisk and the variable name:

const float * restrict im

[url]Programming Guide :: CUDA Toolkit Documentation

wtiandong · April 8, 2018, 12:36am

Hi txbob,
It works, thank you! Using read-only memory does improve performance.
Thank you.