nvprof warning: Metric "gld_throughput" cannot be found on device 0

Dear Forum

I’m reading the Wrox book “Professional CUDA C Programming” and using the downloadable code samples. Chapter 3 introduces nvprof. However when I try to profile I get:

nvprof --metrics gld_throughput ./sumMatrix 32 32
======== Warning: Metric “gld_throughput” cannot be found on device 0.
==4539== NVPROF is profiling process 4539, command: ./sumMatrix 32 32
sumMatrixOnGPU2D <<<(128,128), (32,32)>>> elapsed 0 ms
==4539== Profiling application: ./sumMatrix 32 32
==4539== Profiling result:
No events/metrics were profiled.

My device should have the metric:

nvprof --query-metrics | grep gld_throughput
gld_throughput: Global memory load throughput
nc_gld_throughput: Non-coherent global memory load throughput

I compile with Makefile options:

%: %.cu
nvcc -O2 -arch=sm_35 -o @ < -lcudadevrt --relocatable-device-code true
%: %.c
gcc -O2 -std=c99 -o @ <

I am running a GT730-based card on Ubuntu 14.04. The problem persists with both Toolkit 6.5 and 7.5.

I’ve tried Google and Forum search to no avail. Any ideas much appreciated!

Best regards,
mortennp

There are two different GT 730 products out there. Run the cuda sample deviceQuery on your system and post the results here. also try running your code with cuda-memcheck to see if any errors are reported

Thank you, txbob.

Memcheck seems fine:

cuda-memcheck ./sumMatrix 32 32
========= CUDA-MEMCHECK
sumMatrixOnGPU2D <<<(128,128), (32,32)>>> elapsed 0 ms
========= ERROR SUMMARY: 0 errors

Output from deviceQuery:

./deviceQuery
./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GT 730”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors, (192) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 902 MHz (0.90 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 730
Result = PASS

/mortennp

I don’t have any great ideas then. I don’t happen to have a GT730, but I have a GT640 which is quite similar (cc3.5 GK208 GPU) on CUDA 7.5, and I can profile the metric gld_throughput on various codes without difficulty on it.

Do you get the same error message if you attempt to profile, for example, a cuda sample code such as vectorAdd ?

What is the output of nvidia-smi on your machine?

Yes, same behaviour with vectorAdd from Toolkit samples.

nvidia-smi
±-----------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 730 Off | 0000:01:00.0 N/A | N/A |
| N/A 44C P8 N/A / N/A | 365MiB / 2047MiB | N/A Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+

Thank your for responding, txbob!

I’m pretty much out of ideas. I suspect either:

  1. A bug in nvprof
  2. A corrupted software install on your machine, of some sort.

If you want to pursue item 1, file a bug at developer.nvidia.com
If you want to pursue item 2, try reloading CUDA or your OS

I’m not sure how you installed CUDA, you might make sure that the nouveau driver is removed from your machine.

Instructions for that are in the cuda install manual for linux:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau

Try to compile without the makefile options.

$ nvcc -arch=sm_35 kernel.cu -o kernel

then $ nvprof --devices 0 --metrics gld_throughput ./kernel

i am reading the same book (until chapter 7 now) and all code samples did work as expected (with CUDA v7.0).