CUDA invalid records warning

I just installed cuda on windows 10 with visual studio 2015 and copied the code for a simple example:

Everything compiles and runs without errors, however when I run it with nvprof I get these warnings:

==10564== Warning: Found 48 invalid records in the result.
==10564== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==10564== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 198.75ms 1 198.75ms 198.75ms 198.75ms add(int, float*, float*)
API calls: 51.51% 198.90ms 1 198.90ms 198.90ms 198.90ms cudaDeviceSynchronize
33.78% 130.46ms 2 65.232ms 15.955ms 114.51ms cudaMallocManaged
8.15% 31.483ms 1 31.483ms 31.483ms 31.483ms cuDevicePrimaryCtxRelease
5.93% 22.895ms 1 22.895ms 22.895ms 22.895ms cudaLaunch
0.42% 1.6040ms 2 801.99us 777.10us 826.88us cudaFree
0.14% 548.70us 47 11.674us 284ns 275.63us cuDeviceGetAttribute
0.04% 137.96us 1 137.96us 137.96us 137.96us cuModuleUnload
0.03% 113.21us 1 113.21us 113.21us 113.21us cuDeviceGetName
0.00% 15.645us 1 15.645us 15.645us 15.645us cuDeviceGet
0.00% 5.1200us 1 5.1200us 5.1200us 5.1200us cuDeviceTotalMem
0.00% 2.2750us 1 2.2750us 2.2750us 2.2750us cudaConfigureCall
0.00% 1.7070us 3 569ns 285ns 853ns cudaSetupArgument
0.00% 1.4220us 3 474ns 284ns 853ns cuDeviceGetCount

==10564== Unified Memory profiling result:
Device “GeForce GTX 1060 3GB (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
2048 4.0000KB 4.0000KB 4.0000KB 8.000000MB 10.62628ms Host To Device
256 32.000KB 32.000KB 32.000KB 8.000000MB 2.450205ms Device To Host

In the example after the author incorporated gpu acceleration and multi-threading the total run time was 94 microseconds. It seems although everything is running there is no acceleration. Any help would be appreciated.


can you provide the exact code you are running? The tutorial goes through several variations of code, and it’s not clear which one you are running, please just copy and paste the code you are running, rather than trying to describe in english where you are at.

regarding this:

" It seems although everything is running there is no acceleration."

It’s unclear why you would say there is no acceleration.

The author started out with a naive piece of GPU code that took almost 500 milliseconds and ended up with a code that takes 94 microseconds. It seems to me there is acceleration. If you are referring to the overall runtime of your program, that is a separate issue.

Sorry, I meant my program has no acceleration while the author did get acceleration.

Here is the code I am using:

#include <math.h>
// Kernel function to add the elements of two arrays
void add(int n, float *x, float *y)
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];

int main(void)
int N = 1<<20;
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, Nsizeof(float));
cudaMallocManaged(&y, N

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;

// Run kernel on 1M elements on the GPU
add<<<1, 256>>>(N, x, y);

// Wait for GPU to finish before accessing on host

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory

return 0;

Are you using CUDA 8?

Ah I made a stupid mistake, I forgot to add the following

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

Now the run time is down to 86 microseconds. Sorry about that. It seems to be working as expected now. I am curious though what is the significance of these warnings?


I’m using Cuda 9.1

I’m not sure about the warnings. I’m not able to reproduce that observation.

It’s just a guess, but you could try putting:

#include <cuda_profiler_api.h>

at the top of the code and:


at the end of the code (after the last call to cudaFree())

Other than that, I don’t have any ideas.

Interesting. I tried adding the include and profiler stop but the warnings are still there. I’ll have to read a bit more about it.

Thanks again for your help

Dear Jeff,
It seems nvprof (at least in CUDA 9.1 Release version 9.1.85 (21))
can issue the Warning “1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.”
when the GPU application terminates abnormally.

In my case using --device-buffer-size had no effect on the warning.


Hi there, I am having similar error profiling the DeepBench GEMM, please below:

:~/DeepBench/code$ nvprof bin/gemm_bench
==8878== NVPROF is profiling process 8878, command: bin/gemm_bench
Running training benchmark

m       n      k      a_t     b_t      precision        time (usec)

1760 16 1760 0 0 half 1474
1760 32 1760 0 0 half 7668
1760 64 1760 0 0 half 7604
1760 128 1760 0 0 half 12516
1760 7000 1760 0 0 half 544465
2048 16 2048 0 0 half 2235
2048 32 2048 0 0 half 12382
2048 64 2048 0 0 half 11025
^C==8878== 2048 128 2048 0 0 halfProfiling application: bin/gemm_bench
==8878== Warning: 364 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.

Also, when I run GEMM without nvprof, the system throws this error: “Network error: Software caused connection abort”.

How can I fix it?.. Thanks - Vilmara

FYI, when I execute many threads with my kernel function, I’m getting this warning message “Warning: 1 records have invalid timestamps due to insufficient device buffer space. You can configure the buffer space using the option --device-buffer-size.”, but the warning is absent if I run that same kernel with just one thread (like <<<1, 1>>> instead of <<<1, 1024>>>). I suppose this has to do something with the outputs that I have directly from my device functions (i.e. using printf() function calls).