I’ve benchmarked my kernel using NSight Compute and I’m a little confused as to what needs to be fixed in order to improve performance. I’ve attached the profiling output below. In general I’m a little confused why the F64 pipeline is so heavily utilized when all my input is Float32 (is the compiler automatically fusing things??). It kinda looks like a latency issue since I’m not maxing out compute or memory bandwidth.
One obvious optimization I need to make is that the
cols variables below are too large to fit in cache and probably should be loaded into shared memory somehow. Before I spend time doing that I wanted to ask on here to confirm if that would even help much since the memory bandwidth is only 20%. There also might be something I’m completely over looking in the profiling output that indicates a bigger bottle neck.
MCC_Kernel.ncu-rep (755.9 KB)
This is Julia but the idea is basically the same as C++. Robert if you happen to answer this question, we had discussed the tetrahedral number stuff a week or so ago.
function mcc3_kernel!(K3, rows, cols, Ψ_ijk, phi_i, phi_j, phi_k, maxIdx) thread_id = ((blockIdx().x - 1i32) * blockDim().x) + threadIdx().x; if thread_id <= maxIdx idx1 = rows[thread_id] idx2 = cols[thread_id] idx3 = row_col_2_depth(idx1, idx2, thread_id) K3[thread_id] += (Ψ_ijk * phi_i[idx1] * phi_j[idx2] * phi_k[idx3]) end return nothing end