CUDA might not be working properly and other warnings

I was running this program on the page: I have just started with Cuda.

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
void add(int n, float *x, float *y)
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];

int main(void)
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  return 0;

I got the following warnings:
==4312==Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: (I can’t understand anything much on this page, need directions in layman terms)
==4312== Warning: Found 42 invalid records in the result.
==4312== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.

Even after using 940MX 2GB my profile time is slower than the author’s GT 740M which is rated lower for Cuda.

What is it you are asking?

Do you have a multi-GPU setup? If so, please describe what GPUs you have.
What operating system are you running on?
Which CUDA version are you using?

This code probably isn’t testing any of the things about your GPU that you think it might be testing. Mostly this code is testing the CPU->GPU link, which really has almost nothing to do with any differences between 940MX and GT 740M

Codes designed for beginners learning are rarely useful for performance analysis, or assessing relative performance of two setups. Furthermore, even when they do show something about the difference, its often not what the beginner expects:

Hi txbob.

I get this warning when I use the command line GPU profiler that comes with the CUDA Toolkit for the above-written program.


I’m using Windows 10 64 bit
I got intel 620 (4gb) as another graphics card
CUDA version is 9.2.88

This is the profile result:

==4312== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.23495s         1  1.23495s  1.23495s  1.23495s  add(int, float*, float*)

Time taken by my GPU is 1.23495s while the author in the link [] has mentioned that his GT 740M takes time of half a second.

I want to know how can I get rid of these warnings? Does it mean something important should be repaired? or is something wrong with my Laptop?

I’ll be grateful for the help!

As per documentation given here my gpu 1060 max-q is not in list. Is GTX 1060 max-q design same as GTX 1060(except for some slight performance difference)?Although I followed all the necessary steps to install tensorflow-gpu, while importing tensorflow, it doesn’t show me whether Cuda operated or not. PFB the the screenshot. Confirmation required whether my gpu GTX 1060 max-q is supported by Cuda or not.

Type “help”, “copyright”, “credits” or “license” for more information.

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-06-30 22:13:29.571907: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\platform\] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-30 22:13:30.029579: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.3415
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.96GiB
2018-06-30 22:13:30.035259: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\] Adding visible gpu devices: 0
2018-06-30 22:13:31.874460: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-30 22:13:31.879478: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\] 0
2018-06-30 22:13:31.881696: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\] 0: N
2018-06-30 22:13:31.884738: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4724 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1
2018-06-30 22:13:32.211393: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1



I don’t understand how anything that you have posted is relevant to this thread? I guess if you are asking your own new question it might be better to make a new thread (to maintain the quality of thread) if you are replying to my query, I’m unable to understand it because my problem has nothing to do with tensorflow.

As you say, the GTX 1060 Max-Q is simply a variant of the standard GTX 1060. It supports CUDA with compute capability 6.1, as correctly shown by the log snippet you posted:

It looks like you are on a Windows 10 system, which uses a WDDM 2.x driver. If so, the amount of free memory (freeMemory: 4.96GiB) is as expected given the amount of physical memory (totalMemory: 6.00GiB) on the card.

How you need to configure TensorFlow to make use of this GPU I cannot say, because I do not use TensorFlow.


please post your questions in new topics.

I can’t explain the warning concerning Unified Memory.

The invalid records warning most commonly comes about when a GPU kernel is still executing when a program terminates. However that doesn’t appear to be the case here. You could try putting a


or simply:


statement at the end of your program. That may or may not help. cudaProfilerStop() will require inclusion of the cuda_profiler_api.h header file in your program.

regarding performance, are you building a debug or a release project in visual studio?

I assume when you say command line profiler you are referring to nvprof

Visual Studio also has a built-in profiling capability you can try if you wish.


cudaDeviceReset(); did not helped. Getting the same list of warnings

This is what I’m doing:

  1. Copying the above code to a notepad and saving it as file
  2. running the command prompt and giving the following command: nvcc -o add
  3. Now I get an executable named add
  4. I write nvprof .\add in command prompt
  5. above warning showed up

I guess what you are saying:
“The invalid records warning most commonly comes about when a GPU kernel is still executing when a program terminates”
may be correct because of this warning
==10744== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
I don’t think it should be running out of memory for this basic program.
I’m not using visual studio (It’s actually my first time around visual studio and CUDA so kindly pardon me for silly mistakes).

What do you think should help?