I am new to CUDA and have written a kernel (below) to run on a K40 GPU. I have profiled the kernel using nvprof (selected result below) and it seems that the kernel is memory bound (local memory overhead 96.66%), operating quite a bit below peak FLOPs (1.71%), the L1 memory utilisation is low(1), and both L1 (36.20%) and L2 (52.76) hit rates are suboptimal. My question is whether I am missing any straightforward ways to optimise my kernel.

Details:

Each thread gets one 5x5 matrix of floats d_A as input and returns one float which is copied to d_R. I have set the quantities N=5, cfg_N=32 and MSIZE=55 using `#define`

. I found gridsize=16384 and blocksize=256 to be optimal for performance of this kernel. The used registers per thread are only 43 (ptxas details below) despite all of the necessary arrays/variables for each thread fitting well below the registers-per-thread limit (<80 floats total,<15 integers total). I use the setting `cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)`

. I tried to `#pragma unroll`

some loops but didn’t see any improvement.

kernel:

```
__global__ void PMS_float(float *d_A, float *d_R)
{
int g_id = blockIdx.x*blockDim.x + threadIdx.x;
if (g_id<gridsize)
{
float matrix_list[MSIZE];
float D[cfg_N];
D[0] = 1;
unsigned int mpos = 0;
unsigned int K = N-1;
unsigned int cfg = 1;
unsigned int visited[N] = {0};
visited[N-1]=1;
for (unsigned int i = 0; i<N; ++i)
{
for (unsigned int j = 0; j<N; ++j)
{
matrix_list[i*N+j] = d_A[(i*N+j)*elements+g_id];
}
}
while(visited[0]!=cfg_N-1)
{
D[cfg] = matrix_list[mpos];
if (K==0)
{
K++;
cfg = visited[K];
int K_1 = K+1;
mpos-=K_1*K_1;
}
else if (visited[K-1]<cfg+(1<<(N-K-1)))
{
int K_1 = K+1;
int pos = mpos + K_1*K_1;
for (unsigned int i = 0; i<K; ++i)
{
int i_1 = i+1;
for (unsigned int j = 0; j<K; ++j)
{
int j_1 = j+1;
matrix_list[pos+i*K+j] = matrix_list[mpos+i_1*K_1+j_1];
}
}
mpos = pos;
cfg += (1<<(N-K-1));
K--;
visited[K] = cfg;
}
else if (visited[K-1]<cfg+(1<<(N-K)))
{
int K_1 = K+1;
int pos = mpos + K_1*K_1;
float inv = 1.0/D[cfg];
for (unsigned int i = 0; i<K; ++i)
{
int i_1 = i+1;
float inv1 = matrix_list[mpos+i_1*K_1] * inv;
for (unsigned int j = 0; j<K; ++j)
{
int j_1 = j+1;
matrix_list[pos+i+K*j] = matrix_list[mpos+i_1*K_1+j_1] - matrix_list[mpos+j_1] * inv1;
}
}
mpos = pos;
cfg += (1<<(N-K));
K--;
visited[K] = cfg;
}
else
{
visited[K-1] = 0;
K++;
cfg = visited[K];
int K_1 = K+1;
mpos-= K_1*K_1;
}
}
D[cfg_N-1] = matrix_list[mpos];
int cfg_K = 1;
for (unsigned int level = 0; level < N; ++level)
{
for (unsigned int cfg = cfg_K; cfg < 2*cfg_K; ++cfg)
{
D[cfg] *= D[cfg-cfg_K];
}
cfg_K*=2;
}
d_R[g_id] = D[cfg_N-1];
}
}
```

compile line:

```
nvcc -std=c++11 --gpu-architecture sm_35 -o PMS PMS.cu -O3
```

nvprof:

```
Device "Tesla K40m (0)"
Kernel: PMS_float(float*, float*)
1 l1_cache_global_hit_rate L1 Global Hit Rate 0.00%
1 l1_cache_local_hit_rate L1 Local Hit Rate 36.20%
1 sm_efficiency Multiprocessor Activity 90.74%
1 achieved_occupancy Achieved Occupancy 0.505555
1 dram_read_throughput Device Memory Read Throughput 63.544GB/s
1 dram_write_throughput Device Memory Write Throughput 104.91GB/s
1 l2_l1_read_hit_rate L2 Hit Rate (L1 Reads) 52.76%
1 l2_l1_read_throughput L2 Throughput (L1 Reads) 108.42GB/s
1 local_memory_overhead Local Memory Overhead 96.66%
1 warp_execution_efficiency Warp Execution Efficiency 100.00%
1 local_load_throughput Local Memory Load Throughput 159.09GB/s
1 local_store_throughput Local Memory Store Throughput 72.510GB/s
1 l2_read_throughput L2 Throughput (Reads) 108.53GB/s
1 l2_write_throughput L2 Throughput (Writes) 89.155GB/s
1 stall_inst_fetch Issue Stall Reasons (Instructions Fetch) 5.69%
1 stall_exec_dependency Issue Stall Reasons (Execution Dependency) 24.77%
1 stall_memory_dependency Issue Stall Reasons (Data Request) 48.15%
1 l1_shared_utilization L1/Shared Memory Utilization Low (1)
1 l2_utilization L2 Cache Utilization Mid (4)
1 ldst_fu_utilization Load/Store Function Unit Utilization Mid (4)
1 alu_fu_utilization Arithmetic Function Unit Utilization Low (2)
1 cf_fu_utilization Control-Flow Function Unit Utilization Low (1)
1 flop_sp_efficiency FLOP Efficiency(Peak Single) 1.71%
```

ptxas:

```
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z10PMS_floatfPfS_' for 'sm_35'
ptxas info : Function properties for _Z10PMS_floatfPfS_
368 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 43 registers, 344 bytes cmem[0], 12 bytes cmem[2]
```