CUDA invalid records warning

jeff.2346 · December 31, 2017, 1:27am

I just installed cuda on windows 10 with visual studio 2015 and copied the code for a simple example:

https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

Everything compiles and runs without errors, however when I run it with nvprof I get these warnings:

==10564== Warning: Found 48 invalid records in the result.
==10564== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==10564== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 198.75ms 1 198.75ms 198.75ms 198.75ms add(int, float*, float*)
API calls: 51.51% 198.90ms 1 198.90ms 198.90ms 198.90ms cudaDeviceSynchronize
33.78% 130.46ms 2 65.232ms 15.955ms 114.51ms cudaMallocManaged
8.15% 31.483ms 1 31.483ms 31.483ms 31.483ms cuDevicePrimaryCtxRelease
5.93% 22.895ms 1 22.895ms 22.895ms 22.895ms cudaLaunch
0.42% 1.6040ms 2 801.99us 777.10us 826.88us cudaFree
0.14% 548.70us 47 11.674us 284ns 275.63us cuDeviceGetAttribute
0.04% 137.96us 1 137.96us 137.96us 137.96us cuModuleUnload
0.03% 113.21us 1 113.21us 113.21us 113.21us cuDeviceGetName
0.00% 15.645us 1 15.645us 15.645us 15.645us cuDeviceGet
0.00% 5.1200us 1 5.1200us 5.1200us 5.1200us cuDeviceTotalMem
0.00% 2.2750us 1 2.2750us 2.2750us 2.2750us cudaConfigureCall
0.00% 1.7070us 3 569ns 285ns 853ns cudaSetupArgument
0.00% 1.4220us 3 474ns 284ns 853ns cuDeviceGetCount

==10564== Unified Memory profiling result:
Device “GeForce GTX 1060 3GB (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
2048 4.0000KB 4.0000KB 4.0000KB 8.000000MB 10.62628ms Host To Device
256 32.000KB 32.000KB 32.000KB 8.000000MB 2.450205ms Device To Host

In the example after the author incorporated gpu acceleration and multi-threading the total run time was 94 microseconds. It seems although everything is running there is no acceleration. Any help would be appreciated.

Thanks

Robert_Crovella · December 31, 2017, 1:48am

can you provide the exact code you are running? The tutorial goes through several variations of code, and it’s not clear which one you are running, please just copy and paste the code you are running, rather than trying to describe in english where you are at.

regarding this:

" It seems although everything is running there is no acceleration."

It’s unclear why you would say there is no acceleration.

The author started out with a naive piece of GPU code that took almost 500 milliseconds and ended up with a code that takes 94 microseconds. It seems to me there is acceleration. If you are referring to the overall runtime of your program, that is a separate issue.

jeff.2346 · December 31, 2017, 1:59am

Sorry, I meant my program has no acceleration while the author did get acceleration.

Here is the code I am using:

#include
#include <math.h>
// Kernel function to add the elements of two arrays
global
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1<<20;
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, Nsizeof(float));
cudaMallocManaged(&y, Nsizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Run kernel on 1M elements on the GPU
add<<<1, 256>>>(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);

return 0;
}

Robert_Crovella · December 31, 2017, 2:07am

Are you using CUDA 8?

jeff.2346 · December 31, 2017, 2:08am

Ah I made a stupid mistake, I forgot to add the following

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

Now the run time is down to 86 microseconds. Sorry about that. It seems to be working as expected now. I am curious though what is the significance of these warnings?

Thanks

jeff.2346 · December 31, 2017, 2:08am

I’m using Cuda 9.1

Robert_Crovella · December 31, 2017, 2:17am

I’m not sure about the warnings. I’m not able to reproduce that observation.

It’s just a guess, but you could try putting:

#include <cuda_profiler_api.h>

at the top of the code and:

cudaProfilerStop();

at the end of the code (after the last call to cudaFree())

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#flush-profile-data

Other than that, I don’t have any ideas.

jeff.2346 · December 31, 2017, 2:42am

Interesting. I tried adding the include and profiler stop but the warnings are still there. I’ll have to read a bit more about it.

Thanks again for your help

wlangdon · March 10, 2018, 3:51pm

Dear Jeff,
It seems nvprof (at least in CUDA 9.1 Release version 9.1.85 (21))
can issue the Warning “1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.”
when the GPU application terminates abnormally.

In my case using --device-buffer-size had no effect on the warning.

Bill

sanchezvr7 · March 29, 2018, 2:08pm

Hi there, I am having similar error profiling the DeepBench GEMM, please below:

:~/DeepBench/code$ nvprof bin/gemm_bench
==8878== NVPROF is profiling process 8878, command: bin/gemm_bench
Running training benchmark
Times

m       n      k      a_t     b_t      precision        time (usec)

1760 16 1760 0 0 half 1474
1760 32 1760 0 0 half 7668
1760 64 1760 0 0 half 7604
1760 128 1760 0 0 half 12516
1760 7000 1760 0 0 half 544465
2048 16 2048 0 0 half 2235
2048 32 2048 0 0 half 12382
2048 64 2048 0 0 half 11025
^C==8878== 2048 128 2048 0 0 halfProfiling application: bin/gemm_bench
==8878== Warning: 364 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

Also, when I run GEMM without nvprof, the system throws this error: “Network error: Software caused connection abort”.

How can I fix it?.. Thanks - Vilmara

hackob.melconian · August 10, 2018, 6:32pm

FYI, when I execute many threads with my kernel function, I’m getting this warning message “Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.”, but the warning is absent if I run that same kernel with just one thread (like <<<1, 1>>> instead of <<<1, 1024>>>). I suppose this has to do something with the outputs that I have directly from my device functions (i.e. using printf() function calls).

Topic		Replies	Views
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2556	January 31, 2019
CUDA might not be working properly and other warnings CUDA Programming and Performance	8	1679	July 1, 2018
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2791	March 23, 2018
NVProf error on samples CUDA Programming and Performance	28	20443	December 29, 2020
nvprof error code 139 but memcheck OK Visual Profiler and nvprof	14	13782	December 11, 2020
Cuda code performance CUDA Programming and Performance	14	3129	December 16, 2014
CUDA error, bandwithTest.exe CUDA Setup and Installation	12	2487	January 21, 2019
Using unified memory causes system crash CUDA Programming and Performance	28	5842	February 4, 2019
Internal Profiling error - insufficient kernel bounds data CUDA Programming and Performance	8	4633	May 9, 2016
nvprof never returns CUDA Programming and Performance	8	6304	March 30, 2016

CUDA invalid records warning

:~/DeepBench/code$ nvprof bin/gemm_bench ==8878== NVPROF is profiling process 8878, command: bin/gemm_bench Running training benchmark Times

Related topics

:~/DeepBench/code$ nvprof bin/gemm_bench
==8878== NVPROF is profiling process 8878, command: bin/gemm_bench
Running training benchmark
Times