Using NSight Compute Dump to Optimize My Kernel


I’ve benchmarked my kernel using NSight Compute and I’m a little confused as to what needs to be fixed in order to improve performance. I’ve attached the profiling output below. In general I’m a little confused why the F64 pipeline is so heavily utilized when all my input is Float32 (is the compiler automatically fusing things??). It kinda looks like a latency issue since I’m not maxing out compute or memory bandwidth.

One obvious optimization I need to make is that the K3, rows and cols variables below are too large to fit in cache and probably should be loaded into shared memory somehow. Before I spend time doing that I wanted to ask on here to confirm if that would even help much since the memory bandwidth is only 20%. There also might be something I’m completely over looking in the profiling output that indicates a bigger bottle neck.

MCC_Kernel.ncu-rep (755.9 KB)

This is Julia but the idea is basically the same as C++. Robert if you happen to answer this question, we had discussed the tetrahedral number stuff a week or so ago.

function mcc3_kernel!(K3, rows, cols, Ψ_ijk, phi_i, phi_j, phi_k, maxIdx)
    thread_id = ((blockIdx().x - 1i32) * blockDim().x) + threadIdx().x;  

    if thread_id <= maxIdx
        idx1 = rows[thread_id]
        idx2 = cols[thread_id]
        idx3 =  row_col_2_depth(idx1, idx2, thread_id)
        K3[thread_id] += (Ψ_ijk * phi_i[idx1] * phi_j[idx2] * phi_k[idx3])

    return nothing

How does Julia assign types to the function arguments? The answer to that question should reveal why there are FP64 computations.

Julia uses JIT, I could specify the types in the function signature so that an error gets thrown if they’re not all Float32, but I double checked that my inputs were all Float32 and Int32. I will add the types to the parameters to be sure but I don’t think thats it.

In a dynamically-typed language, it is best to examine the types as close to the point of use as possible. Can you dump the assigned types from within the function?

What about implicit type promotion rules, does Julia have any of those? If so, could any of them explain the use of FP64 operations?

Is there a way for programmers to dump the intermediate PTX code passed by the Julia tool chain to the JIT compiler inside the CUDA driver?

Once you manage to eliminate the use of FP64 arithmetic, I would expect this code to be severely limited by memory throughput, exacerbated by the use of an additional level of indirection, which may cause sub-optimal access patterns.

1 Like

You might get better help asking about julia on a julia forum.

Yeah I’ll ask there about the float64 stuff. I’m pretty sure my types are all correct though.

This isn’t that important anyways as I found a library that will just do this on the tensor cores for me.