Terrible performance from very simple kernel

I’ve got a trivial kernel I use as part of a larger application. I’m trying to improve performance and was surprised by how poorly this performs.

/// Zero out the given buffer
extern "C" __global__ void clear_buffer(float4* accum_buffer, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        accum_buffer[i] = make_float4(0, 0, 0, 0);
    }
}

profiling this in nsight compute tells me that my memory throughput is 18MB/s (that’s megabytes!!)

What’s causing this?

Suggestion: Post complete minimal example code (i.e. buildable & runnable) if you would like others to attempt to reproduce your observations.
Side remark: Make sure you use a release build for performance experiments.

Something is evidently wrong. When I run your kernel using the supplied configuration <<<7975,128>>> on a V100 I get a kernel duration of ~20 microseconds. Your kernel duration of ~60milliseconds suggests you may be timing a debug build, or something else is not working as expected in your setup. (Even when I run a debug build, the 20 microseconds only goes to 80 microseconds). My V100 has about 700-900GB/s of memory throughput, your RTX 2070 max-Q has about 384GB/s of bandwidth, so that only represents about a 2-3x factor.

Probably need more information, or you may need to simplify (divide and conquer) your test case to find the culprit.

Thanks for the help! I was in the process of isolating that kernel in a simple C++ program following njuffa’s advice. The actual program this is a part of is a large Rust program running a sequence of kernels so extracting exactly what’s happening will be more difficult.

In isolating it to a simpler C++ version:

extern "C" __global__ void clear_buffer(float4* accum_buffer, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        accum_buffer[i] = make_float4(0, 0, 0, 0);
    }
}

int main() {
    float4* buffer;
    int grid_size = 7579;
    int block_size = 128;
    int n = grid_size * block_size;

    cudaMalloc(&buffer, n * 4 * sizeof(float));
    
    clear_buffer<<<grid_size, block_size>>>(buffer, n);

    cudaFree(buffer);
}

I was indeed using -G instead of --generate-line-info. Fixing that gives me ~347GB/s as expected.

Doing the same thing in my actual program still only nets me 1.5GB/s and a total execution time of 1.2ms as opposed to 42us in the extracted version.

Inspecting the PTX of both versions shows identical code (modulo names), so I guess it must be how I’m running it in the larger program?

For context, I’m seeing abysmal memory throughput on my other kernels as well, this one just seemed like a good starting point as it was so simple.

I’m a bit of a noob here - I don’t really know what sort of things to consider when looking at the rest of my code to understand what’s going on. What sort of things could potentially cause this?

Version in main program:
clear_buffer.pdf (994.7 KB)

Extracted version:
simple.pdf (984.7 KB)

It’s possible that compute kernels are being “interrupted” i.e. context-switched with graphics tasks as needed by your RTX 2070 Max-Q notebook GPU which is running in WDDM mode and must service a display.

I generally don’t have an expectation of hard performance guarantees on any GPU that is also servicing a display. Even if you think your GPU is not servicing a display, if it is in WDDM mode, than it is under the direct control of windows and will be expected to service graphics tasks, from time to time.

My guess is that most likely it will be necessary for you to do divide-and-conquer to discover what components (even if it is just overall application run-time) which are necessary to result in the perf reduction. As already indicated, providing a short complete test case here may also get you useful advice.