#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
I got the following warnings:
==4312==Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation (I can’t understand anything much on this page, need directions in layman terms)
==4312== Warning: Found 42 invalid records in the result.
==4312== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
Even after using 940MX 2GB my profile time is slower than the author’s GT 740M which is rated lower for Cuda.
Do you have a multi-GPU setup? If so, please describe what GPUs you have.
What operating system are you running on?
Which CUDA version are you using?
This code probably isn’t testing any of the things about your GPU that you think it might be testing. Mostly this code is testing the CPU->GPU link, which really has almost nothing to do with any differences between 940MX and GT 740M
Codes designed for beginners learning are rarely useful for performance analysis, or assessing relative performance of two setups. Furthermore, even when they do show something about the difference, its often not what the beginner expects:
I get this warning when I use the command line GPU profiler that comes with the CUDA Toolkit for the above-written program.
The
I’m using Windows 10 64 bit
I got intel 620 (4gb) as another graphics card
CUDA version is 9.2.88
This is the profile result:
==4312== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.23495s 1 1.23495s 1.23495s 1.23495s add(int, float*, float*)
As per documentation given here https://developer.nvidia.com/cuda-gpus my gpu 1060 max-q is not in list. Is GTX 1060 max-q design same as GTX 1060(except for some slight performance difference)?Although I followed all the necessary steps to install tensorflow-gpu, while importing tensorflow, it doesn’t show me whether Cuda operated or not. PFB the the screenshot. Confirmation required whether my gpu GTX 1060 max-q is supported by Cuda or not.
Type “help”, “copyright”, “credits” or “license” for more information.
import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2018-06-30 22:13:29.571907: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-06-30 22:13:30.029579: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.3415
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.96GiB
2018-06-30 22:13:30.035259: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-30 22:13:31.874460: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-30 22:13:31.879478: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2018-06-30 22:13:31.881696: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2018-06-30 22:13:31.884738: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4724 MB memory) → physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 → device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1
2018-06-30 22:13:32.211393: I C:\users\nwani_bazel_nwani\swultrt5\execroot\org_tensorflow\tensorflow\core\common_runtime\direct_session.cc:284] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 → device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1
I don’t understand how anything that you have posted is relevant to this thread? I guess if you are asking your own new question it might be better to make a new thread (to maintain the quality of thread) if you are replying to my query, I’m unable to understand it because my problem has nothing to do with tensorflow.
As you say, the GTX 1060 Max-Q is simply a variant of the standard GTX 1060. It supports CUDA with compute capability 6.1, as correctly shown by the log snippet you posted:
It looks like you are on a Windows 10 system, which uses a WDDM 2.x driver. If so, the amount of free memory (freeMemory: 4.96GiB) is as expected given the amount of physical memory (totalMemory: 6.00GiB) on the card.
How you need to configure TensorFlow to make use of this GPU I cannot say, because I do not use TensorFlow.
I can’t explain the warning concerning Unified Memory.
The invalid records warning most commonly comes about when a GPU kernel is still executing when a program terminates. However that doesn’t appear to be the case here. You could try putting a
cudaProfilerStop();
or simply:
cudaDeviceReset();
statement at the end of your program. That may or may not help. cudaProfilerStop() will require inclusion of the cuda_profiler_api.h header file in your program.
regarding performance, are you building a debug or a release project in visual studio?
I assume when you say command line profiler you are referring to nvprof
Visual Studio also has a built-in profiling capability you can try if you wish.
cudaDeviceReset(); did not helped. Getting the same list of warnings
This is what I’m doing:
Copying the above code to a notepad and saving it as add.cu file
running the command prompt and giving the following command: nvcc add.cu -o add
Now I get an executable named add
I write nvprof .\add in command prompt
above warning showed up
I guess what you are saying:
“The invalid records warning most commonly comes about when a GPU kernel is still executing when a program terminates”
may be correct because of this warning
==10744== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
I don’t think it should be running out of memory for this basic program.
I’m not using visual studio (It’s actually my first time around visual studio and CUDA so kindly pardon me for silly mistakes).