==12188== NVPROF is profiling process 12188, command: ./add_cuda
Max error: 0
==12188== Profiling application: ./add_cuda
==12188== Profiling result:
No kernels were profiled.
==12188== API calls:
No API activities were profiled.
==12188== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139
I think it is not the nvprof’s fault because I also tested a sample program in nvidia’s examples and things works fine.
So then I try to use memcheck to check whether there’s any problems in my program but memcheck didn’t give any useful information.
cuda-memcheck ./add_cuda
========= CUDA-MEMCHECK
Max error: 0
========= ERROR SUMMARY: 0 errors
zns@zns-gpu:~/Public/test$ cuda-memcheck --leak-check full --error-exitcode ./add_cuda
========= CUDA-MEMCHECK
========= Nothing to check
========= No CUDA-MEMCHECK results found
So, what can I do to make nvprof work?
Here’s my code
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaProfilerStop();
return 0;
}
I build the sample you gave and do profiling. All works well.
Maybe the build process have some problem.
How do you build the sample ? Also which toolkit/driver/gpu are you using?
==16896== Unified Memory profiling result:
Device “GeForce GTX 1070 (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 1.329280ms Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 644.0640us Device To Host
24 - - - - 2.522912ms Gpu page fault groups
Total CPU Page faults: 36
Hi, I’m new, just adding a datapoint - still investigating. Some code from online that I slightly modified. When N <=16 nvprof works, when N >= 17 nvprof fails with error 139.
// CUDA kernel to add elements of two arrays global
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<10;
float *x, *y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, Nsizeof(float));
cudaMallocManaged(&y, Nsizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on N elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
I had this exact same problem with a very simple cuBLAS program. Quite strange as such because it was working fine and in between I increased the matrix dim from 1024 to 2048 and then the problem started. And it didnt go even after reverting back to 1024!
I tried the --unified-memory-profiling off and also --concurrent-kernels off. Nothing helped.
The problem can sometimes be with the unified memory system.
// Added these two lines after kernel execution checkCuda(cudaStreamAttachMemAsync(NULL, C, 0, cudaMemAttachHost)); checkCuda(cudaStreamSynchronize(NULL));