Hello all,
I am running the code from the example https://developer.nvidia.com/blog/even-easier-introduction-cuda/. Nvprof is telling me that to run this example’s add.cu it is taking about 3.5ms, while it took the author 680us to run it on his MacBook’s GTX750M. 3.5ms does not seem right for a discrete desktop 1070 GPU.
Attached below is the code I am using, I basically copied it verbatim from the example:
#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
// Run kernel on 1M elements on the GPU
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
I am using the following command to compile:
/usr/local/cuda-11.6/bin/nvcc add.cu -g -o cudaAdd
My results from nvprof:
matthew@sonOfAnton:~/cudaSandbox/simpleCudaExample$ /usr/local/cuda-11/bin/nvprof ./cudaAdd
==3514== NVPROF is profiling process 3514, command: ./cudaAdd
Max error: 0
==3514== Profiling application: ./cudaAdd
==3514== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 4.5185ms 1 4.5185ms 4.5185ms 4.5185ms add(int, float*, float*)
API calls: 95.96% 133.33ms 2 66.663ms 100.06us 133.23ms cudaMallocManaged
3.25% 4.5196ms 1 4.5196ms 4.5196ms 4.5196ms cudaDeviceSynchronize
0.47% 655.75us 2 327.88us 281.24us 374.51us cudaFree
0.25% 342.70us 101 3.3930us 410ns 200.93us cuDeviceGetAttribute
0.03% 43.863us 1 43.863us 43.863us 43.863us cudaLaunchKernel
0.03% 41.799us 1 41.799us 41.799us 41.799us cuDeviceGetName
0.00% 6.0910us 1 6.0910us 6.0910us 6.0910us cuDeviceGetPCIBusId
0.00% 4.1280us 3 1.3760us 711ns 2.3750us cuDeviceGetCount
0.00% 2.2040us 2 1.1020us 420ns 1.7840us cuDeviceGet
0.00% 832ns 1 832ns 832ns 832ns cuDeviceTotalMem
0.00% 742ns 1 742ns 742ns 742ns cuDeviceGetUuid
==3514== Unified Memory profiling result:
Device "NVIDIA GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
82 99.902KB 4.0000KB 996.00KB 8.000000MB 974.4760us Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 341.4110us Device To Host
12 - - - - 4.460260ms Gpu page fault groups
Total CPU Page faults: 36
Data from running nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:07:00.0 On | N/A |
| 28% 29C P8 6W / 151W | 37MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1065 G /usr/lib/xorg/Xorg 35MiB |
+-----------------------------------------------------------------------------+
Any help would be appreciated, thank you! :)