Unexpected Performance Behavior with CUDA Software Prefetcher, Warm-Up Kernel and GEMV

qikch · December 1, 2025, 2:08pm

Dear Everyone,

I have encountered an unusual observation while developing a CUDA software prefetcher for CUTLASS GEMV. My implementation runs the prefetcher concurrently in a separate stream alongside the CUTLASS GEMV. Here’s what I have observed so far:

Performance Improvement Without Warmup:
When I run only the software prefetcher and the CUTLASS GEMV concurrently (in separate streams), I observe a performance improvement compared to running only the CUTLASS GEMV:

prefetch<<<1, 1024, 0, prefetch_stream>>>(
    d_nemotron_q, elements_per_page, total_matrix_pages, total_elements
);

cudaEventRecord(start, stream);
run_cutlass_gemv_row_major<Gemv>(
    M, K, alpha, beta, d_nemotron_q, d_input_vector, output, empty, stream
);
cudaEventRecord(stop, stream);

Performance Drop After Warmup:
Surprisingly, when I run a warmup kernel for both the prefetcher and the CUTLASS GEMV, the performance improvement essentially disappears. This is counterintuitive because a warmup kernel usually only has a minor impact on benchmark timing, yet in this case, it significantly reduces the observed speedup.

float* warmup_output;

float* warmup_empty;

size_t warmup_size = 1 * 1024;

cudaMalloc(&warmup_output, sizeof(float) * warmup_size);

cudaMalloc(&warmup_empty, sizeof(float) * warmup_size);

prefetch<<<1, 512>>>(d_warmup_matrix, elements_per_page, 1, 1024*1024);

run_cutlass_gemv_row_major<Gemv>(1024, 1024, alpha, beta, d_warmup_matrix, d_warmup_vector, warmup_output, warmup_empty, nullptr);

# then run the benchmark

prefetch<<<1, 1024, 0, prefetch_stream>>>(
        d_nemotron_q, elements_per_page, total_matrix_pages, total_elements);    
cudaEventRecord(start, stream);run_cutlass_gemv_row_major<Gemv>(M, K, alpha, beta, d_nemotron_q, d_input_vector, output, empty, stream);
cudaEventRecord(stop, stream);

I’m trying to understand why the warmup kernel would have such a dramatic effect on the concurrent performance of the prefetcher and CUTLASS GEMV. Any insights or suggestions would be greatly appreciated.

Thank you for your time and help!

Curefab · December 1, 2025, 2:56pm

So with warmup it got faster. What is the unexpected? Why it specifically had an effect on concurrent execution? Or in other words, why the serial execution is so much worse without warmup?

qikch · December 1, 2025, 3:04pm

What’s unexpected isn’t that the warmup makes things faster, it’s how much faster they become. The warmup kernel itself only accounts for about 0.02 ms of overhead.

Yet without the warmup, the GEMV runs much slower, far more than that 0.02 ms can explain.

Even more strangely, when I run GEMV together with my software prefetcher, I see a large performance improvement, but this improvement largely disappears once I perform a warmup run before the prefetcher+GEMV experiment.

In other words, the warmup shouldn’t be able to influence performance to this extent, yet it does. That’s the unexpected behavior I’m trying to understand.

Robert_Crovella · December 1, 2025, 3:19pm

cudaEvent based timing might be giving you information that cannot be interpreted easily (and this is somewhat more likely in a multi-stream environment). If this were my experiment, I would probably double-check my interpretation by using nsight systems.

For a similar reason, I would probably also start by just inspecting the timing of the gemv kernel, without any usage of the prefetch kernel.

If that does indeed show a similar performance progression (~2.8ms to ~0.6ms) and there is nothing obvious in the nsight systems timeline clouding things, then the next step could be to look at nsight compute to find out the pareto of performance limiters. You may start to get some concrete insight that way.

The first time you run a kernel may indeed vary performance-wise from subsequent runs. I don’t have an exhaustive list of all the reasons, but one possible contributor is the state of the caches. Nsight compute will by default invalidate the caches (and you can modify this behavior with profiling switches), so I would also pay attention to the kernel duration performance reported by nsight compute when I got to that step. Another possible contributor to first time behavior is lazy loading. I don’t expect that lazy loading could impact things by 1ms or more, but it could affect things in the microsecond range. A side effect of lazy loading is synchronization, and this is one of the things I would consider. It’s hard to spot that in the nsight systems timeline, but you could look at the cudaEvent report vs. the nsight systems timeline to make inferences about where it may be impacting things, if anywhere.

Curefab · December 1, 2025, 3:20pm

And the measurement is across the complete program duration?

You could measure phases of the program separately.

You could try a different kernel for warmup instead to see if power state and frequency have an influence or this specific kernel.

You could disable LAZY_LOADING with environment variables.

Perhaps somebody else here has a direct explanation.

qikch · December 2, 2025, 12:33pm

Hi Robert,

Thank you for your detailed response. You’re absolutely right. When I inspect the timings in Nsight Systems, I do see a different time than I expected. The time matches the time with warm-up

However, I’m not entirely sure why this discrepancy occurs. I had assumed my software prefetcher was functioning as intended, but I want to diagnose the issue and understand the root cause of the timing difference.

Do you have any advice on how I could investigate this further? I’ve already reviewed my code and confirmed that I’m not recording any timing before the kernel launches.

Thank you for your guidance!

Robert_Crovella · December 2, 2025, 4:12pm

The profilers are the most advanced tools I know of.

qikch · December 2, 2025, 4:27pm

Hi Robert,

Thanks I am aware that this probably not easy to diagnose. I meant more diagnosing the root cause of why cudaEvents shows a different time measurement compared to Nsight Systems. Do you have any advice by any chance? If not then I completely understand

Greg · December 2, 2025, 5:09pm

cudaEvent timestamps is captured at the GPU front end. If you push the cudaEventRecord then have a large amount of additional work pushed by the driver to upload your kernel the recorded timestamp for the event will be significantly before the start of the kernel.

Developer tools use two techniques not available through the CUDA API.

For pre-Blackwell (and for some Blackwell environments) the tool can measure the start timestamp either directly before the kernel launch or the tool can instrument the kernel code to output the start timestamp. The end timestamp is taken when the work completes through a mechanism not available through the current CUDA API.
For Blackwell+ the new hardware event system supports tracing when a grid is launched and completed in the hardware.

cudaEventRecords can run into numerous issues especially when applications are using streams resulting in timestamp values consistent with the hardware front end but not accurate for measuring a kernel.

If you need accurate timing for CUDA kernels then I highly recommend you file a feature request to the CUDA API to add a more accurate method to collect timestamps.

qikch · December 3, 2025, 11:37am

Hi Greg,

Thank you for the extensive explanation, I was wondering if you could tell me how to file a feature request?

Curefab · December 3, 2025, 12:06pm

I think it is the same interface as to report bugs. See the pinned message at the top of the threads list.

Topic		Replies	Views
Software prefetch at kernel level CUDA Programming and Performance	0	434	December 20, 2020
A frustrating question ! call for help About the cuda timer CUDA Programming and Performance	4	1626	February 18, 2009
Strange Runtime behavior CUDA Programming and Performance	7	3254	December 18, 2009
Decreased performances if CUDA kernels are not run continuously Jetson TX2	1	509	June 8, 2018
Getting Different Execution Times of Running Same Kernel Twice CUDA Programming and Performance	1	93	August 13, 2024
Time measurement CUDA Programming and Performance	2	1235	September 13, 2009
NVPROF showing GPU Fault though I am using cudaPrefetch CUDA Programming and Performance	6	671	December 27, 2023
Why warm-up? CUDA Programming and Performance	15	12279	February 13, 2025
Why the measure time for second kernel is extremely short? CUDA Programming and Performance	5	109	May 13, 2025
Repeated CUDA kernel calls get slower, not faster CUDA Programming and Performance	2	82	April 22, 2026

Unexpected Performance Behavior with CUDA Software Prefetcher, Warm-Up Kernel and GEMV

Related topics