Hi all,
I’m working on an image processing kernel for neural network called im2col. What it does is just copy the image data to shared memory and then copy back to global memory with another layout format for matrix multiplication. I use the nvvp to profile my kernel, nvvp tells me that the utilization of my kernel is low, the compute and memory utilization is just 25%.
shared memory:
shared loads 17.5 GB/s
shared stores 2.0 GB/s
L2 Cache:
Reads: 2.01 GB/s
Writes 20.651 GB/s
Unified Cache:
local loads/stores 0 B/s
Global loads: 1.97 GB/s
Global stores: 20.6 GB/s
Texture reads: 1.971 GB/s
I compared it with an open source implementation, which does not use shared memory(I think it access global memory redundantly, so I add shared memory to reduce access global memory).
But the open source implementation can achieve 85% instruction utilization and 95% memory utilization, however the memory bandwidth is low:
Shared memory is 0 B/s because no shared mem.
L2 Cache is 3.6 GB/s read and 3.2 GB/s write, which is lower than mine.
Unified cache:
global loads: 3.6 GB/s
global stores: 3.2 GB/s
Texture reads: 3.1 GB/s
my configuration is <<<512, 512>>>, the open source implementation is also <<<512, 512>>> configuration.
how to achieve high utilization?
Here’s my code:
__global__
void im2col_gpu_kernel_inference_3x3(const float *im, const int height, const int width,
const int ksize, const int pad, int stride, int height_col, int width_col, int dataPerChannel,
float *data_col){
extern __shared__ float pixel_line[];
int index = (blockIdx.x) * blockDim.x + threadIdx.x;
int blockInnerIndex = __mul24(threadIdx.y, blockDim.x) + threadIdx.x;
pixel_line[threadIdx.x] = im[index];
__syncthreads();
int h_Index = blockIdx.x;
int block_mat_row = min(blockIdx.x + ksize - 1, ksize - 1);
int block_mat_col = h_Index - block_mat_row;
for(int i = 0;
i < 3 && block_mat_row >= 0 && block_mat_col < height_col;
block_mat_row--, block_mat_col++, i++) {
float *im_col_ptr = data_col
+ __mul24(__mul24(block_mat_row , ksize), dataPerChannel)
+ __mul24(block_mat_col, width_col);
#pragma unroll
for(int j = 0; j < 3; j++) {
float *im_col_ptr_current = im_col_ptr + __mul24(j, dataPerChannel);
if(threadIdx.x < width_col)
im_col_ptr_current[threadIdx.x] = pixel_line[threadIdx.x + j];
}
}
}
int main() {
// some initialization code
int height = 512;
int width = 512
dim3 gDim(height, 1, 1);
dim3 bDim(width, 1, 1);
size_t sharedMemSize = width * sizeof(float);
im2col_gpu_kernel_inference_3x3<<<gDim, bDim, sharedMemSize>>>(
im, height, width, ksize, pad,
stride, height_col,
width_col, dataPerChannel, data_col);
}