Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU

The code is only this

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void helloFromGPU(void)
{
	printf("Hello World from GPU!\n");
}

int main(void)
{
	helloFromGPU << <1, 1 >> >();
	cudaDeviceReset();
	return 0;
}

but I always got this error

D:\Programing\CudaTest\x64\Debug>nvprof CudaTest
==14844== NVPROF is profiling process 14844, command: CudaTest
Hello World from GPU!
==14844== Profiling application: CudaTest
==14844== Warning: Found 41 invalid records in the result.
==14844== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==14844== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 45.215us 1 45.215us 45.215us 45.215us helloFromGPU(void)
API calls: 77.18% 121.10ms 1 121.10ms 121.10ms 121.10ms cudaLaunch
22.52% 35.337ms 1 35.337ms 35.337ms 35.337ms cudaDeviceReset
0.23% 363.82us 55 6.6140us 255ns 166.32us cuDeviceGetAttribute
0.06% 90.954us 1 90.954us 90.954us 90.954us cuDeviceGetName
0.01% 9.4530us 1 9.4530us 9.4530us 9.4530us cuDeviceTotalMem
0.00% 6.3870us 1 6.3870us 6.3870us 6.3870us cudaConfigureCall
0.00% 2.0430us 2 1.0210us 255ns 1.7880us cuDeviceGetCount
0.00% 767ns 1 767ns 767ns 767ns cuDeviceGet

D:\Programing\CudaTest\x64\Debug>nvprof CudaTest
==13116== NVPROF is profiling process 13116, command: CudaTest
Hello World from GPU!
==13116== Profiling application: CudaTest
==13116== Warning: Found 46 invalid records in the result.
==13116== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==13116== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 45.183us 1 45.183us 45.183us 45.183us helloFromGPU(void)
API calls: 71.63% 99.948ms 1 99.948ms 99.948ms 99.948ms cudaLaunch
28.05% 39.142ms 1 39.142ms 39.142ms 39.142ms cudaDeviceReset
0.24% 340.82us 48 7.1000us 255ns 161.73us cuDeviceGetAttribute
0.06% 87.377us 1 87.377us 87.377us 87.377us cuDeviceGetName
0.00% 6.1310us 1 6.1310us 6.1310us 6.1310us cudaConfigureCall
0.00% 5.8760us 1 5.8760us 5.8760us 5.8760us cuDeviceTotalMem
0.00% 2.3000us 3 766ns 255ns 1.7890us cuDeviceGetCount
0.00% 766ns 2 383ns 255ns 511ns cuDeviceGet

D:\Programing\CudaTest\x64\Debug>nvprof CudaTest
==4692== NVPROF is profiling process 4692, command: CudaTest
Hello World from GPU!
==4692== Profiling application: CudaTest
==4692== Warning: Found 23 invalid records in the result.
==4692== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==4692== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 44.991us 1 44.991us 44.991us 44.991us helloFromGPU(void)
API calls: 68.18% 86.218ms 1 86.218ms 86.218ms 86.218ms cudaLaunch
31.45% 39.769ms 1 39.769ms 39.769ms 39.769ms cudaDeviceReset
0.29% 366.12us 73 5.0150us 255ns 165.56us cuDeviceGetAttribute
0.07% 93.253us 1 93.253us 93.253us 93.253us cuDeviceGetName
0.01% 6.8980us 1 6.8980us 6.8980us 6.8980us cuDeviceTotalMem
0.00% 6.1320us 1 6.1320us 6.1320us 6.1320us cudaConfigureCall
0.00% 2.2990us 2 1.1490us 255ns 2.0440us cuDeviceGetCount
0.00% 766ns 1 766ns 766ns 766ns cuDeviceGet

even the this file which doesn’t access any kernel I also got the warning

#include <cuda_runtime.h>
#include "device_launch_parameters.h"

#include "common.h"

int main(int argc, char *argv[]) {
	int iDev = 0;
	cudaDeviceProp iProp;
	cudaGetDeviceProperties(&iProp, iDev);
	printf("Device %d: %s\n", iDev, iProp.name);
	printf("Number of multiprocessors: %d\n", iProp.multiProcessorCount);
	printf("Total amount of constant memory: %4.2f KB\n",
		iProp.totalConstMem / 1024.0);
	printf("Total amount of shared memory per block: %4.2f KB\n",
		iProp.sharedMemPerBlock / 1024.0);
	printf("Total number of registers available per block: %d\n",
		iProp.regsPerBlock);
	printf("Warp size: %d\n", iProp.warpSize);
	printf("Maximum number of threads per block: %d\n", iProp.maxThreadsPerBlock);
	printf("Maximum number of threads per multiprocessor : %d\n",
		iProp.maxThreadsPerMultiProcessor);
	printf("Maximum number of warps per multiprocessor: %d\n",
		iProp.maxThreadsPerMultiProcessor / 32);
	cudaDeviceReset();
	return EXIT_SUCCESS;
}

D:\Programing\CudaTest\x64\Debug>nvprof CudaTest
==7204== NVPROF is profiling process 7204, command: CudaTest
Device 0: GeForce GTX 1080
Number of multiprocessors: 20
Total amount of constant memory: 64.00 KB
Total amount of shared memory per block: 48.00 KB
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum number of threads per multiprocessor : 2048
Maximum number of warps per multiprocessor: 64
==7204== Profiling application: CudaTest
==7204== Warning: Found 52 invalid records in the result.
==7204== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==7204== Profiling result:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 49.97% 478.27us 42 11.387us 255ns 259.32us cuDeviceGetAttribute
34.57% 330.86us 1 330.86us 330.86us 330.86us cudaGetDeviceProperties
9.69% 92.743us 1 92.743us 92.743us 92.743us cuDeviceGetName
4.72% 45.221us 1 45.221us 45.221us 45.221us cudaDeviceReset
0.64% 6.1310us 1 6.1310us 6.1310us 6.1310us cuDeviceTotalMem
0.27% 2.5550us 3 851ns 255ns 1.7890us cuDeviceGetCount
0.13% 1.2780us 2 639ns 256ns 1.0220us cuDeviceGet

and when I am trying to profile gld_efficiancy (on other cuda file) I got no result because of this warning

this is the project https://drive.google.com/open?id=18oYuKoFEG9yDlv9zTrkVly0-1BVAdTzP

I have no idea how to solve this now.

Any help is appreciate
Thanks.

PS. I build using visual studio community 2017 by clinking Build -> Rebuild Solution and nvprof on the .exe that output from this command. I try restart computer, reinstall visual studio, reinstall cuda and make geforce experience driver the same as shipped with cuda but it is still the same. 2 days ago I don’t have this kind of problem and nvprof was working fine now everytime I try nvprof it shows this warning

what happens if you profile a sample project such as vectorAdd ?

what driver version is running on your GPU?

This is the result when run with deviceQuery

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.1\bin\win64\Release>nvprof deviceQuery
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

==6256== NVPROF is profiling process 6256, command: deviceQuery
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1759 MHz (1.76 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS
==6256== Profiling application: deviceQuery
==6256== Warning: Found 23 invalid records in the result.
==6256== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==6256== Profiling result:
No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   43.55%  368.93us         1  368.93us  368.93us  368.93us  cudaGetDeviceProperties
                   42.28%  358.20us        71  5.0450us     255ns  166.83us  cuDeviceGetAttribute
                   10.52%  89.165us         1  89.165us  89.165us  89.165us  cuDeviceGetName
                    2.38%  20.184us         1  20.184us  20.184us  20.184us  cudaSetDevice
                    0.66%  5.6210us         1  5.6210us  5.6210us  5.6210us  cuDeviceTotalMem
                    0.39%  3.3210us         3  1.1070us     511ns  2.2990us  cuDeviceGetCount
                    0.09%     767ns         2     383ns     256ns     511ns  cuDeviceGet
                    0.06%     511ns         1     511ns     511ns     511ns  cudaGetDeviceCount
                    0.03%     256ns         1     256ns     256ns     256ns  cudaDriverGetVersion
                    0.03%     256ns         1     256ns     256ns     256ns  cudaRuntimeGetVersion

With vectorAdd

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.1\bin\win64\Debug>nvprof vectorAdd
[Vector addition of 50000 elements]
==5052== NVPROF is profiling process 5052, command: vectorAdd
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==5052== Profiling application: vectorAdd
==5052== Warning: Found 37 invalid records in the result.
==5052== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==5052== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   59.02%  34.144us         2  17.072us  16.288us  17.856us  [CUDA memcpy HtoD]
                   27.38%  15.839us         1  15.839us  15.839us  15.839us  [CUDA memcpy DtoH]
                   13.61%  7.8720us         1  7.8720us  7.8720us  7.8720us  vectorAdd(float const *, float const *, float*, int)
      API calls:   77.79%  129.28ms         3  43.093ms  4.3440us  129.27ms  cudaMalloc
                   21.25%  35.317ms         1  35.317ms  35.317ms  35.317ms  cuDevicePrimaryCtxRelease
                    0.35%  578.68us         1  578.68us  578.68us  578.68us  cuModuleUnload
                    0.22%  361.01us         3  120.34us  98.107us  154.83us  cudaMemcpy
                    0.21%  351.04us        59  5.9490us     255ns  165.30us  cuDeviceGetAttribute
                    0.09%  148.69us         3  49.564us  9.1970us  117.52us  cudaFree
                    0.06%  95.808us         1  95.808us  95.808us  95.808us  cuDeviceGetName
                    0.02%  29.892us         1  29.892us  29.892us  29.892us  cudaLaunch
                    0.00%  7.1530us         1  7.1530us  7.1530us  7.1530us  cuDeviceTotalMem
                    0.00%  4.5980us         1  4.5980us  4.5980us  4.5980us  cudaConfigureCall
                    0.00%  4.0880us         3  1.3620us     255ns  3.3220us  cudaSetupArgument
                    0.00%  3.0660us         1  3.0660us  3.0660us  3.0660us  cudaGetLastError
                    0.00%  2.2990us         3     766ns     255ns  1.7890us  cuDeviceGetCount
                    0.00%     511ns         1     511ns     511ns     511ns  cuDeviceGet

I use Visual Studio Community 2017 version 15.4.5 and Geforce Experience Driver version 388.19

PS. gld_efficiency is my bad for miss-spell to gld_efficiancy now with the correct argument I got the result as expected but I don’t know the running time is being effected because of this warning or not.

what is the result when you run

nvidia-smi

in a command prompt?

what is the result when you run

nvprof --version

in a command prompt?

Here is the result of those two commands

C:\Users\SaintTail>nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2017 NVIDIA Corporation
Release version 9.1.85 (21)

C:\Users\SaintTail>cd C:\Program Files\NVIDIA Corporation\NVSMI

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi
Sun Jan 21 01:28:30 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.19                 Driver Version: 388.19                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080   WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   45C    P8    14W / 240W |    402MiB /  8192MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1144    C+G   Insufficient Permissions                   N/A      |
|    0      2104    C+G   ...13411.0_x64__8wekyb3d8bbwe\Video.UI.exe N/A      |
|    0      4484    C+G   ...x64__8wekyb3d8bbwe\Microsoft.Photos.exe N/A      |
|    0      4924    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    0      5952    C+G   C:\Windows\explorer.exe                    N/A      |
|    0      6372    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0      6528    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    0     10552    C+G   ...app\AgileBits.OnePassword.Desktop.exe N/A      |
|    0     12396    C+G   ...mmersiveControlPanel\SystemSettings.exe N/A      |
+-----------------------------------------------------------------------------+

You may want to file a bug at developer.nvidia.com

In your bug you can link back to this forum thread.

The title of the bug should be concise, something like:

nvprof on windows CUDA 9.1 always warns about invalid records

I create the bug report at https://developer.nvidia.com/nvidia_bug/2049717

Thanks for your help

Can we know what happened to the issue? Is it solved in a new release?
I’m using CUDA 9.2 and I’m still see these warning messages

Thanks

This should be fixed in CUDA 10.

Great. Thanks.