CUDA #pragma unroll is slower than unrolled code

Hello, I’m CUDA programming beginner. I’m making fractal(mandelbrot set) computing program by using CUDA with PyCUDA. But, I found #pragma unroll 5 is slower than code unrolled by me. Here is my test code.

Environment: i7-5820K, GTX970, Ubuntu 16.04.3, CUDA 9.0.176, Python 3.5.2, PyCUDA 2017.1.1

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
import matplotlib.cm as cm
import colorcet as cc
from matplotlib.colors import Normalize
from PIL import Image
from time import time

cu = SourceModule("""
    __global__ void calcMandelbrot(double *matrix, int *result) {
        int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
        int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
        double a = matrix[0] + (float)idx_x * matrix[2];
        double b = matrix[1] + (float)idx_y * matrix[2];

        double an = a, bn = b;
        double aan = an * an, bbn = bn * bn;

    #pragma unroll 5
        for (int i = 0; i < 1000; i++) {
            bn = 2 * an * bn + b;
            an = aan - bbn + a;
            aan = an * an;
            bbn = bn * bn;
            
            if (an * an + bn * bn > 4.0f) {
                result[idx_x * gridDim.y * blockDim.y + idx_y] = i;
                break;
            }
        }
    }
""")

def calcMandelbrot(start, size, unit):
    s = time()
    # change size to multiple of 16(bigger or equal to prior size). Because of threads per block is 16
    size = ((size + 255) / 256).astype(np.uint32) * 256

    result = np.empty((size[0], size[1]), np.int32)

    func = cu.get_function("calcMandelbrot")
    func(cuda.In(np.array([start[0] - size[0] / 2 * unit,
                           start[1] - size[1] / 2 * unit, unit], np.float64)),
         cuda.Out(result),
         block=(16, 16, 1), grid=(int(size[0] / 16), int(size[1] / 16)))

    result = np.transpose(result)

    print("%f s" % (time() - s))

calcMandelbrot(np.array([0.39514148, -0.18187805]), np.array([800, 800]), 0.00015650431982753305)

It takes about 0.05 s. Also, I tried not only #pragma unroll 5 but also 10, 100, etc, but they are similar.

But, the code unrolled by me is faster than ‘#pragma unroll’ code.
Here is my unrolled code.

import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
import matplotlib.cm as cm
import colorcet as cc
from matplotlib.colors import Normalize
from PIL import Image
from time import time

cu = SourceModule("""
    __global__ void calcMandelbrot(double *matrix, int *result) {
        int idx_x = blockIdx.x * blockDim.x + threadIdx.x;
        int idx_y = blockIdx.y * blockDim.y + threadIdx.y;
        double a = matrix[0] + (float)idx_x * matrix[2];
        double b = matrix[1] + (float)idx_y * matrix[2];

        double an = a, bn = b;
        double aan = an * an, bbn = bn * bn;

        for (int i = 0; i < 200; i+=5) {
            // iteration 1
            bn = 2 * an * bn + b;
            an = aan - bbn + a;
            aan = an * an;
            bbn = bn * bn;
            
            if (an * an + bn * bn > 4.0f) {
                result[idx_x * gridDim.y * blockDim.y + idx_y] = i;
                break;
            }

            // iteration 2
            bn = 2 * an * bn + b;
            an = aan - bbn + a;
            aan = an * an;
            bbn = bn * bn;
            
            if (an * an + bn * bn > 4.0f) {
                result[idx_x * gridDim.y * blockDim.y + idx_y] = i + 1;
                break;
            }

            // iteration 3
            bn = 2 * an * bn + b;
            an = aan - bbn + a;
            aan = an * an;
            bbn = bn * bn;
            
            if (an * an + bn * bn > 4.0f) {
                result[idx_x * gridDim.y * blockDim.y + idx_y] = i + 2;
                break;
            }

            // iteration 4
            bn = 2 * an * bn + b;
            an = aan - bbn + a;
            aan = an * an;
            bbn = bn * bn;
            
            if (an * an + bn * bn > 4.0f) {
                result[idx_x * gridDim.y * blockDim.y + idx_y] = i + 3;
                break;
            }

            // iteration 5
            bn = 2 * an * bn + b;
            an = aan - bbn + a;
            aan = an * an;
            bbn = bn * bn;
            
            if (an * an + bn * bn > 4.0f) {
                result[idx_x * gridDim.y * blockDim.y + idx_y] = i + 4;
                break;
            }
        }
    }
""")

def calcMandelbrot(start, size, unit):
    s = time()
    # change size to multiple of 16(bigger or equal to prior size). Because of threads per block is 16
    size = ((size + 255) / 256).astype(np.uint32) * 256

    result = np.empty((size[0], size[1]), np.int32)

    func = cu.get_function("calcMandelbrot")
    func(cuda.In(np.array([start[0] - size[0] / 2 * unit,
                           start[1] - size[1] / 2 * unit, unit], np.float64)),
         cuda.Out(result),
         block=(16, 16, 1), grid=(int(size[0] / 16), int(size[1] / 16)))

    result = np.transpose(result)

    print("%f s" % (time() - s))

calcMandelbrot(np.array([0.39514148, -0.18187805]), np.array([800, 800]), 0.00015650431982753305)

It takes about 0.02 s, much faster. Why does this happen?

What is the GPU target architecture this is being compiled for? It’s not clear how Python will compile this.

Disassemble and examine the generated machine code (SASS) for the two cases to find out. My guess is this is because your manually unrolled second code is not exactly equivalent to a mechanically unrolled version of the first code. In particular, the handling of ‘i’ differs. Note how you pulled its intermediate increments into the conditional clauses. There may be other differences but I didn’t look at the codes very closely.

In general, the (partial) unrolling of loops by the compiler is unlikely to be (much of) a performance win when the loop body contains a conditional ‘break’, as compared to a fully rolled version of such a loop.

Compiling with the CUDA 8 offline compiler shows significant differences in the code generated for the two source code versions. For sm_50, the compiler unrolls the manually unrolled loop (i.e. version #2) by another factor of two. For sm_20, the compiler uses different control-flow primitives for version #1 (PBRK, BRK) than version #2 (BRA) which may have influence on thread divergence. Also, the unconditional incrementing of ‘i’ is visible in the SASS for version #1

In addition to comparing the generated SASS, you would also want to compare execution statistics from the CUDA profiler, using a third version that employs a fully-rolled loop as a baseline and control.

It may not make much of a difference here, but I would suggest changing the kernel signature as follows:

__global__ void calcMandelbrot (const double * __restrict__ matrix, int * __restrict__ result)