optimising memory bound kernel

I am new to CUDA and have written a kernel (below) to run on a K40 GPU. I have profiled the kernel using nvprof (selected result below) and it seems that the kernel is memory bound (local memory overhead 96.66%), operating quite a bit below peak FLOPs (1.71%), the L1 memory utilisation is low(1), and both L1 (36.20%) and L2 (52.76) hit rates are suboptimal. My question is whether I am missing any straightforward ways to optimise my kernel.

Details:

Each thread gets one 5x5 matrix of floats d_A as input and returns one float which is copied to d_R. I have set the quantities N=5, cfg_N=32 and MSIZE=55 using #define. I found gridsize=16384 and blocksize=256 to be optimal for performance of this kernel. The used registers per thread are only 43 (ptxas details below) despite all of the necessary arrays/variables for each thread fitting well below the registers-per-thread limit (<80 floats total,<15 integers total). I use the setting cudaDeviceSetCacheConfig(cudaFuncCachePreferL1). I tried to #pragma unroll some loops but didn’t see any improvement.

kernel:

__global__ void PMS_float(float *d_A, float *d_R)
    {
      int g_id = blockIdx.x*blockDim.x + threadIdx.x;
      if (g_id<gridsize)
      {
        float matrix_list[MSIZE];
        float D[cfg_N];
        D[0] = 1;

        unsigned int mpos = 0;
        unsigned int K = N-1;
        unsigned int cfg = 1;
        unsigned int visited[N] = {0};
        visited[N-1]=1;

        for (unsigned int i = 0; i<N; ++i)
        {
          for (unsigned int j = 0; j<N; ++j)
          {
            matrix_list[i*N+j] = d_A[(i*N+j)*elements+g_id];
          }
        }

        while(visited[0]!=cfg_N-1)
        {
          D[cfg] = matrix_list[mpos];
          if (K==0)
          {
            K++;
            cfg = visited[K];
            int K_1 = K+1;
            mpos-=K_1*K_1;
          }
          else if (visited[K-1]<cfg+(1<<(N-K-1)))
          {
            int K_1 = K+1;
            int pos = mpos + K_1*K_1;
            for (unsigned int i = 0; i<K; ++i)
            {
              int i_1 = i+1;
              for (unsigned int j = 0; j<K; ++j)
              {
                int j_1 = j+1;
                matrix_list[pos+i*K+j] = matrix_list[mpos+i_1*K_1+j_1];
              }
            }
            mpos = pos;
            cfg += (1<<(N-K-1));
            K--;
            visited[K] = cfg;
          }
          else if (visited[K-1]<cfg+(1<<(N-K)))
          {
            int K_1 = K+1;
            int pos = mpos + K_1*K_1;
            float inv = 1.0/D[cfg];
            for (unsigned int i = 0; i<K; ++i)
            {
              int i_1 = i+1;
              float inv1 = matrix_list[mpos+i_1*K_1] * inv;
              for (unsigned int j = 0; j<K; ++j)
              {
                int j_1 = j+1;
                matrix_list[pos+i+K*j] = matrix_list[mpos+i_1*K_1+j_1] - matrix_list[mpos+j_1] * inv1;
              }
            }
            mpos = pos;
            cfg += (1<<(N-K));
            K--;
            visited[K] = cfg;
          }
          else
          {
            visited[K-1] = 0;
            K++;
            cfg = visited[K];
            int K_1 = K+1;
            mpos-= K_1*K_1;
          }
        }
        D[cfg_N-1] = matrix_list[mpos];

        int cfg_K = 1;
        for (unsigned int level = 0; level < N; ++level)
        {
          for (unsigned int cfg = cfg_K; cfg < 2*cfg_K; ++cfg)
          {
            D[cfg] *= D[cfg-cfg_K];
          }
          cfg_K*=2;
        }
        d_R[g_id] = D[cfg_N-1];
      }
    }

compile line:

nvcc -std=c++11 --gpu-architecture sm_35 -o PMS PMS.cu -O3

nvprof:

Device "Tesla K40m (0)"
    Kernel: PMS_float(float*, float*)
          1                  l1_cache_global_hit_rate                           L1 Global Hit Rate       0.00%
          1                   l1_cache_local_hit_rate                            L1 Local Hit Rate      36.20%
          1                             sm_efficiency                      Multiprocessor Activity      90.74%
          1                        achieved_occupancy                           Achieved Occupancy    0.505555
          1                      dram_read_throughput                Device Memory Read Throughput  63.544GB/s
          1                     dram_write_throughput               Device Memory Write Throughput  104.91GB/s
          1                       l2_l1_read_hit_rate                       L2 Hit Rate (L1 Reads)      52.76%
          1                     l2_l1_read_throughput                     L2 Throughput (L1 Reads)  108.42GB/s
          1                     local_memory_overhead                        Local Memory Overhead      96.66%
          1                 warp_execution_efficiency                    Warp Execution Efficiency     100.00%
          1                     local_load_throughput                 Local Memory Load Throughput  159.09GB/s
          1                    local_store_throughput                Local Memory Store Throughput  72.510GB/s
          1                        l2_read_throughput                        L2 Throughput (Reads)  108.53GB/s
          1                       l2_write_throughput                       L2 Throughput (Writes)  89.155GB/s
          1                          stall_inst_fetch     Issue Stall Reasons (Instructions Fetch)       5.69%
          1                     stall_exec_dependency   Issue Stall Reasons (Execution Dependency)      24.77%
          1                   stall_memory_dependency           Issue Stall Reasons (Data Request)      48.15%
          1                     l1_shared_utilization                 L1/Shared Memory Utilization     Low (1)
          1                            l2_utilization                         L2 Cache Utilization     Mid (4)
          1                       ldst_fu_utilization         Load/Store Function Unit Utilization     Mid (4)
          1                        alu_fu_utilization         Arithmetic Function Unit Utilization     Low (2)
          1                         cf_fu_utilization       Control-Flow Function Unit Utilization     Low (1)
          1                        flop_sp_efficiency                 FLOP Efficiency(Peak Single)       1.71%

ptxas:

ptxas info    : 0 bytes gmem
    ptxas info    : Compiling entry function '_Z10PMS_floatfPfS_' for 'sm_35'
    ptxas info    : Function properties for _Z10PMS_floatfPfS_
        368 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas info    : Used 43 registers, 344 bytes cmem[0], 12 bytes cmem[2]

here’s a trivial optimization to try:

const float* restrict d_A

for an explanation see here

https://devblogs.nvidia.com/cuda-pro-tip-optimize-pointer-aliasing/

These arrays will always end up in (rather slow) local memory, unless you make all indexed accesses use compile time constants. This can be achieved by fully unrolling all your loops that generate indices accessing any of these arrays.

float matrix_list[MSIZE];
float D[cfg_N];
unsigned int visited[N] = {0};

In my code I prefer to explicitly specify the unroll parameter whenever I intend to fully unroll loops, i.e. #pragma unroll 5 for loops such as for (unsigned int i = 0; i<N; ++i). However the pragmas need to be adjusted when #defines such as N are changed.

Be aware that matrix_list in particular seems to large to fit entirely into registers, so you may end up with lots of register spills instead. Maybe investigate the use of shared memory to store thread specific matrices.

EDIT: the following tip seems wrong, as visited is not used as a boolean array in your code

The visited array could be turned into a bitmask fitting into a single (unsigned) integer for N <= 32 (use uint64_t type to support up to N=64)

Thanks for the reply.

  • what would be the right way to store thread specific arrays in shared memory? Reserve memory of the size of the (matrix_list+D+visited)*blocksize?
  • I tried using const float* restrict d_A, but there was no visible improvement
  • Indeed visited array can contain any integer < cfg_N. It’s in fact already the minimal version instead of storing visited[cfg_N] booleans.
  • what exactly do you mean by all?
  • For N=5 unrolling indeed helped the performance by about 15%. At N=10 it actually made the code 35% slower.
  • EDIT: I tried making the D array shared D[256*cfg_N] and rewrote all calls to it accordingly. Surprisingly the performance of the kernel didn’t improve and actually the % of peak FLOPS even fell by 50%.

Can you try this version? I think I managed to fully unroll all loops that compute indices for accessing matrix_list, so matrix_list ends up in registers. It brings the reg count to 69, but I force a 64 register limit to achieve a better occupancy.

The Kepler Compute 3.5 architecture allows up to 256 registers per thread if I remember correctly but we don’t want to max this out.

D and visited are still in local memory.

__global__ void PMS_float(float *d_A, float *d_R)
        {   
          int g_id = blockIdx.x*blockDim.x + threadIdx.x;
          if (g_id<gridsize)
          {
            float matrix_list[MSIZE];
            float D[cfg_N];
            D[0] = 1;

            unsigned int mpos = 0;
            unsigned int K = N-1;
            unsigned int cfg = 1;
            unsigned int visited[N] = {0};
            visited[N-1]=1;

            #pragma unroll 5
            for (unsigned int i = 0; i<N; ++i)
            {
              #pragma unroll 5
              for (unsigned int j = 0; j<N; ++j)
              {
                matrix_list[i*N+j] = d_A[(i*N+j)*elements+g_id];
              }
            }

            while(visited[0]!=cfg_N-1)
            {
              D[cfg] = matrix_list[mpos];
              if (K==0)
              {
                K++;
                cfg = visited[K];
                int K_1 = K+1;
                mpos-=K_1*K_1;
              }
              else if (visited[K-1]<cfg+(1<<(N-K-1)))
              {
                int K_1 = K+1;
                int pos = mpos + K_1*K_1;
                #pragma unroll 5
                for (unsigned int i = 0; i<N; ++i)
                {
                  if (i < K)
                  {
                    int i_1 = i+1;
                    #pragma unroll 5
                    for (unsigned int j = 0; j<N; ++j)
                    {
                      if (j < K)
                      {
                        int j_1 = j+1;
                        matrix_list[pos+i*K+j] = matrix_list[mpos+i_1*K_1+j_1];
                      } else break;
                    }
                  } else break;
                }
                mpos = pos;
                cfg += (1<<(N-K-1));
                K--;
                visited[K] = cfg;
              }
              else if (visited[K-1]<cfg+(1<<(N-K)))
              {
                int K_1 = K+1;
                int pos = mpos + K_1*K_1;
                float inv = 1.0/D[cfg];
                #pragma unroll 5
                for (unsigned int i = 0; i<N; ++i)
                {
                  if (i < K)
                  {
                    int i_1 = i+1;
                    float inv1 = matrix_list[mpos+i_1*K_1] * inv;
                    #pragma unroll 5
                    for (unsigned int j = 0; j<N; ++j)
                    {
                      if (j < K)
                      {
                        int j_1 = j+1;
                        matrix_list[pos+i+K*j] = matrix_list[mpos+i_1*K_1+j_1] - matrix_list[mpos+j_1] * inv1;
                      } else break;
                    }
                  } else break;
                }
                mpos = pos;
                cfg += (1<<(N-K));
                K--;
                visited[K] = cfg;
              }
              else
              {
                visited[K-1] = 0;
                K++;
                cfg = visited[K];
                int K_1 = K+1;
                mpos-= K_1*K_1;
              }
            }
            D[cfg_N-1] = matrix_list[mpos];

            int cfg_K = 1;
            for (unsigned int level = 0; level < N; ++level)
            {
                for (unsigned int cfg = cfg_K; cfg < 2*cfg_K; ++cfg)
                {
                    D[cfg] *= D[cfg-cfg_K];
                }
                cfg_K*=2;
            }

            d_R[g_id] = D[cfg_N-1];
          }
        }

Here’s my build command (tested with CUDA 7.5)
nvcc -O3 -std=c++11 -D_FORCE_INLINES --gpu-architecture sm_35 -maxrregcount=64 --ptxas-options=-v -o PMS PMS.cu

EDIT:

I just added these build flags as well to see the PTX code -keep -src-in-ptx -lineinfo. Interestingly, the code uses more registers, yet it appears the matrix_list is still kept in local memory (probably due to being indexed with the pos and mpos variables). So right now I am no longer sure if this modification will result in a speedup.

At this point I would probably take a step back and instead of trying to optimize this already complex piece of code to run faster, I would see if I can rebuild it using components that have been specifically engineered for performance. i.e. by designing a 5x5 matrix class that is designed to operate fully in registers right from the start and provides just the required operators to run your algorithm.

The goal is to reduce the Local Memory Overhead from 96.66% to 0%.

float inv = 1.0/D[cfg];

careful with double precision literals in code. This triggers the use of double precision arithmetics, which is rather slow on Geforce and some Titan class devices. I’d suggest to use 1.0f/D[cfg]; instead.