I wrote a simple code: compute the addition of two vectors.
Following Chapter 4 in the book Professional CUDA C Programming, I’m trying unrolling to get better performance.
Code
full code here: https://github.com/SaoYan/Learning_CUDA/blob/master/Ch4/temp.cu
base code:
__global__ void sumArraysOnDevice(float *A, float *B, float *C, const int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) C[idx] = A[idx] + B[idx];
}
......
sumArraysOnDevice<<<grid, block>>>(d_A, d_B, d_C, nElem);
x2 unrolling:
__global__ void sumArraysOnDeviceUnroll2(float *A, float *B, float *C, const int N) {
int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;
if (idx + blockDim.x < N) {
C[idx] = A[idx] + B[idx];
C[idx + blockDim.x] = A[idx + blockDim.x] + B[idx + blockDim.x];
}
}
......
sumArraysOnDeviceUnroll2<<<grid.x / 2, block>>>(d_A, d_B, d_C, nElem);
x4 unrolling:
__global__ void sumArraysOnDeviceUnroll4(float *A, float *B, float *C, const int N) {
int idx = blockIdx.x * blockDim.x * 4 + threadIdx.x;
if (idx + 3 * blockDim.x < N) {
C[idx] = A[idx] + B[idx];
C[idx + blockDim.x] = A[idx + blockDim.x] + B[idx + blockDim.x];
C[idx + 2 * blockDim.x] = A[idx + 2 * blockDim.x] + B[idx + 2 * blockDim.x];
C[idx + 3 * blockDim.x] = A[idx + 3 * blockDim.x] + B[idx + 3 * blockDim.x];
}
}
......
sumArraysOnDeviceUnroll4<<<grid.x / 4, block>>>(d_A, d_B, d_C, nElem);
Results
I observed no performance gain after unrolling, and even some performance drop sometimes…
block.x = 1024:
no unrolling <<< 262144, 1024 >>> elapsed 8.922000 ms
unroll2 <<< 131072, 1024 >>> elapsed 8.960000 ms
unroll4 <<< 65536, 1024 >>> elapsed 8.990000 ms
block.x = 512:
no unrolling <<< 524288, 512 >>> elapsed 8.934000 ms
unroll2 <<< 262144, 512 >>> elapsed 8.923000 ms
unroll4 <<< 131072, 512 >>> elapsed 8.963000 ms
block.x = 256:
no unrolling <<< 1048576, 256 >>> elapsed 8.881000 ms
unroll2 <<< 524288, 256 >>> elapsed 8.906000 ms
unroll4 <<< 262144, 256 >>> elapsed 8.932000 ms
block.x = 128:
no unrolling <<< 2097152, 128 >>> elapsed 8.911000 ms
unroll2 <<< 1048576, 128 >>> elapsed 8.944000 ms
unroll4 <<< 524288, 128 >>> elapsed 8.978000 ms
More info
Hardware: I’m using GTX 1080 Ti GPU