Improving GPU Performance by Reducing Instruction Cache Misses

jwitsoe · August 8, 2024, 4:30pm

Originally published at: https://developer.nvidia.com/blog/improving-gpu-performance-by-reducing-instruction-cache-misses-2/

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming multiprocessors (SMs), and an array of facilities to keep them fed with data: high bandwidth to memory, sizable data caches, and the capability to switch to other teams of workers (warps)…

robv · August 8, 2024, 6:28pm

This post combines multiple features of the Nsight Compute tool to analyze performance of a certain workload. Please let us know if you have questions about the presentation or specifics of the use of the tool.

khaliliamir90 · August 9, 2024, 12:50am

This is a question regarding HE in federated learning developed by nVIDIA in Clara 4.0:

In a scenario where the goal is to foster collaboration among competing companies in a market, companies participating as clients in Federated Learning (FL) each hold their own decryption keys to access the updates in the model they receive from the server. However, I’m curious about how the updates encrypted by other clients are handled, given that no client possesses the keys to decrypt another client’s updates. Could someone please clarify this? Thank you!

jwitsoe · August 9, 2024, 4:58pm

@khaliliamir90 – Did you mean to post this on the Federated Learning with Homomorphic Encryption post?

khaliliamir90 · August 9, 2024, 5:32pm

Yes, my bad, I posted my message there, after realizing my mistake. Thanks!

neta1 · April 22, 2025, 11:11am

You mentioned that to view sm__icc_requests, sm__icc_requests_lookup_hit, and sm__icc_requests_lookup_miss, I need to use CUDA Toolkit 12.9 or later. However, as far as I can tell, the latest publicly available version is 12.8.1 - both on CUDA toolkit archive and the Docker Hub.

I also tried running:
ncu --query-metrics | grep icc

on a setup using CUDA 12.8.1 and got no results.

Could you clarify how I can access these metrics?

robv · April 22, 2025, 10:50pm

Unfortunately, the requisite version of the CUDA Toolkit (12.9) is indeed not yet publicly available. I f you have not done so yet already, please sign up to be notified of new releases.

Topic		Replies	Views
Improving GPU Performance by Reducing Instruction Cache Misses Technical Blog	4	688	April 22, 2024
CUDA compiler bug or user error? CUDA Programming and Performance	28	2480	July 28, 2017
The Peak-Performance-Percentage Analysis Method for Optimizing Any GPU Workload Technical Blog	6	1003	July 11, 2019
Boosting Application Performance with GPU Memory Prefetching Technical Blog	7	1157	March 9, 2023
Error if "private" not on same line as "parallel loop" nvc, nvc++ and nvfortran	31	941	September 21, 2023
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1693	August 31, 2020
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2439	October 25, 2017
Some issues regarding the use of prefetch in the cuda kernel CUDA Programming and Performance cuda , kernel	19	119	June 11, 2025
Quadro RTX 8000 Multi-GPU Performance Issue CUDA Programming and Performance	13	1195	March 8, 2025
Measuring the GPU Occupancy of Multi-stream Workloads Technical Blog	1	205	April 20, 2024

Improving GPU Performance by Reducing Instruction Cache Misses

Related topics