Nsight Compute and Roofline Model (And load-stores in Matrix Multiplications)


Due to educational and formatting reasons I am implementing the naive roofline model by myself, as some matrix multiplications we had were suspiciously close to the limit in the roofline model of Nsight compute.

To achieve it I employ cudaGetDeviceProperties and a look-up table on architectures to CUDA cores and get theoretical peak memory bandwidth and peak floating point performance.

In Nsight compute I would get an arithmetic intensity of ~5 with my matrix multiplication kernel where my calculations resulted with an intensity of ~4. I know how to get the ~5 intensity but I want to ask whether it was a coincidence or not.

So, the kernel implements C = C + A * B.
I will use row_x and col_x to denote the number of rows or columns of a matrix (x=c for C, a for A, b for B). For the matrix multiplication I would compute the number of operations and required load/store bytes as follows:

The kernel is implemented with the sum of outer products approach. Intermediary results are written to a temporary matrix. Regardless of the implementation, I would expect the number of arithmetic operations to be: (row_a) * (col_a * col_b) * 2 as we need to multiply and add. Furthermore these results needed to be added to matrix C so I would also add row_c * col_c operations too. If you believe it is crucial, I can copy source code, or describe the kernels approach in detail but I think the more important part is the load and stores.

On loads and stores I would except to read matrix A (row_a * col_a elements) and then B(row_bcol_b) and to read and store C (2 * row_ccol_c). This results with an intensity ~4. (Every matrix has row and column count of 32) But if I assume row_ccol_c instead of 2row_c*col_c I get almost exact results to Nsight compute (I cant verify it because the plot there doesnt have exact numbers). The code in python looks like the following where FLOAT_SIZE is 4, as C-floats having 32bits.

def calculate_ops_dense():
ops = (row_a) * 2 * (col_a * col_b)
ops += row_c * col_c
load_store = row_a * col_a + row_b * col_b + row_c * col_c
load_store *= FLOAT_SIZE
return (ops, load_store)

My question would be, is the reading of matrix C optimized away as we are adding directly to it? Because from Nsight compute it looks like that.

I haven’t looked at the code you described in detail yet, but here are some pointers that may help you determine the reason for the differences you are seeing:

  • You can find the metrics used in the roofline computations in the respective .section files. The one your are interested in probably is SpeedOfLight_HierarchicalSingleRooflineChart.section. You can find them in your user documents directory under NVIDIA Nsight Compute\<version>\Sections.
  • You can lookup the exact values of these metrics in the tool, e.g. using the UI’s Raw page.
  • The roofline computation in ncu is not based on the app’s algorithm, but purely on the executed instructions. You can find the overall instruction mix in the Instruction Statistics section on the Details page. If you build your application with -lineinfo and collect the full set, the Source page will show you the mapping from high-level C/C++ to SASS assembly code and metrics for how many times each instruction was executed. This can help to understand how the compiler transferred your algorithm to assembly.

Thank you for the response, I will look into those. I am already compiling with -lineinfo, and collect with --set full and --import-source yes. I should have those information somewhere.

When I find new information I will update here

Sounds good. As a further note, you will be able to access almost all of this data from a captured report also procedurally using the Python Report Interface which has several samples here, assuming you are doing any subsequent analysis in Python anyways.

I just added the the instructions for a very small matrix. Running with 32 threads (1 block - a very small example). I have had 1024 shared loads (LDS), 1024 FFMA, 96 lDG, 32 STS, 32 STG and 32 FADD.

The other instructions are IADD3, LEA, IMAD, ISETP, BRA, CS2R, S2R. – I believe these should not to be needed to consider in the calculation.

If I compute Flops per byte I get 32 x 128 floats lodaded, therefore 321284 bytes loaded.
The amount of flops executed then I count as 32 x 1024 x 2 (FFMA counts as, as I saw in sectiosn too) and 32 x 32 FADD.

This would make 4 floating point operations per byte. The Nsight Compute profiler indicated an arithmetic intensity of 5.36. Could it be due to sampling? Could it be the case that other instructions like IADD3 used during the address calculation counted too? (Since it works on integers I would not except for it to be the case)

Screenshots I have attached: