nvprof error code 139 but memcheck OK

penguinnn · May 28, 2017, 7:47am

I’m new to cuda programming. I’m trying to use nvprof to test my program in this tutorialhttps://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/.

However, when I ran

nvprof ./add_cuda

I got

==12188== NVPROF is profiling process 12188, command: ./add_cuda
Max error: 0
==12188== Profiling application: ./add_cuda
==12188== Profiling result:
No kernels were profiled.

==12188== API calls:
No API activities were profiled.
==12188== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

I think it is not the nvprof’s fault because I also tested a sample program in nvidia’s examples and things works fine.

So then I try to use memcheck to check whether there’s any problems in my program but memcheck didn’t give any useful information.

cuda-memcheck ./add_cuda
========= CUDA-MEMCHECK
Max error: 0
========= ERROR SUMMARY: 0 errors
zns@zns-gpu:~/Public/test$ cuda-memcheck --leak-check full --error-exitcode ./add_cuda
========= CUDA-MEMCHECK
========= Nothing to check
========= No CUDA-MEMCHECK results found

So, what can I do to make nvprof work?

Here’s my code

#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>

// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  cudaProfilerStop();
  return 0;
}

veraj · May 29, 2017, 7:53am

Hi, penguinnn

I build the sample you gave and do profiling. All works well.
Maybe the build process have some problem.
How do you build the sample ? Also which toolkit/driver/gpu are you using?

Sample output is “Max error: 0”

root@DEVTOOLS-QA76:~/cuda-workspace/test/Debug# nvprof ./test
==16896== NVPROF is profiling process 16896, command: ./test
Max error: 0
==16896== Profiling application: ./test
==16896== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 936.96ms 1 936.96ms 936.96ms 936.96ms add(int, float*, float*)
API calls: 84.96% 936.97ms 1 936.97ms 936.97ms 936.97ms cudaDeviceSynchronize
14.84% 163.65ms 2 81.823ms 134.20us 163.51ms cudaMallocManaged
0.12% 1.2799ms 188 6.8070us 110ns 344.52us cuDeviceGetAttribute
0.04% 401.64us 2 200.82us 198.05us 203.59us cuDeviceTotalMem
0.03% 349.80us 2 174.90us 171.89us 177.90us cudaFree
0.01% 85.688us 1 85.688us 85.688us 85.688us cudaLaunch
0.01% 81.508us 2 40.754us 38.324us 43.184us cuDeviceGetName
0.00% 4.4650us 3 1.4880us 165ns 3.7750us cudaSetupArgument
0.00% 4.2570us 1 4.2570us 4.2570us 4.2570us cudaConfigureCall
0.00% 2.1150us 3 705ns 147ns 1.7270us cuDeviceGetCount
0.00% 825ns 4 206ns 127ns 410ns cuDeviceGet

==16896== Unified Memory profiling result:
Device “GeForce GTX 1070 (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
48 170.67KB 4.0000KB 0.9961MB 8.000000MB 1.329280ms Host To Device
24 170.67KB 4.0000KB 0.9961MB 4.000000MB 644.0640us Device To Host
24 - - - - 2.522912ms Gpu page fault groups
Total CPU Page faults: 36

zjw518 · June 9, 2017, 2:07pm

I am having this exact issue, too. Did you find the source of the problem?

parallel_pete · August 31, 2017, 3:11pm

Hi, I’m new, just adding a datapoint - still investigating. Some code from online that I slightly modified. When N <=16 nvprof works, when N >= 17 nvprof fails with error 139.

nvprof --unified-memory-profiling per-process-device ./unifiedMemTest

#include
#include <math.h>
#include <cuda_profiler_api.h>

// CUDA kernel to add elements of two arrays
global
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1<<10;
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, Nsizeof(float));
cudaMallocManaged(&y, Nsizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Launch kernel on N elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);
cudaProfilerStop();
cudaDeviceReset();

return 0;
}

zjw518 · August 31, 2017, 3:54pm

It’s a bug with unified memory profiling, the flag

--unified-memory-profiling off

resolves all problems for me.

kmankad · October 5, 2017, 5:55am

I’m also seeing the same issue. If we have an integer array larger than 2^15 elements thats allocated via managed memory, nvprof/nvvp craps out.

My setup : CUDA 8.0, Titan Xp (Pascal) with Driver version 381.09.

So much for memory over-subscription if such a tiny array causes a segfault like this! :-(

kmankad · October 5, 2017, 5:57am

Can someone from NVIDIA confirm if this is being looked at, or is already fixed in 9.0? I’m kinda stuck here :(

I have an application for which I’d like to analyze the number of GPU page faults, whose working set is larger than this “2^15” limit.

zjw518 · October 5, 2017, 1:15pm

It’s fixed for me in 9.0.

apanqasem · October 25, 2017, 10:47pm

updating to 9.0 fixed the issue for me as well.

(it seemed that nvprof ran fine if there was only one GPU page fault in the code. But choked when there was more than one)

fbc23 · January 10, 2018, 11:40pm

This also worked for me on CUDA8.0. Thanks!

EDIT: Having now read Unified Memory for CUDA Beginners | NVIDIA Technical Blog I think it’s clearer when to turn on/off this flag

helloworld1101 · April 2, 2018, 12:40am

I got same error in v9.1, any idea to fix that ?

youkaichao · April 30, 2018, 3:04am

updating to 9.0 fixed the issue for me as well.

Robert_Crovella · December 21, 2018, 4:05pm

it may also help to run nvprof as root (on linux) if you are getting 139 error

jasm3285 · October 15, 2019, 1:42am

--unified-memory-profiling off

Adding this flag worked for me too using 10.1.

rreddy78 · December 11, 2020, 4:25pm

FWIW:

I had this exact same problem with a very simple cuBLAS program. Quite strange as such because it was working fine and in between I increased the matrix dim from 1024 to 2048 and then the problem started. And it didnt go even after reverting back to 1024!

I tried the --unified-memory-profiling off and also --concurrent-kernels off. Nothing helped.
The problem can sometimes be with the unified memory system.

With the following code it works fine:

checkCUBLAS(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, M_SIZE, M_SIZE, M_SIZE, &alpha, A, M_SIZE, B, M_SIZE, &beta, C, M_SIZE));

// Added these two lines after kernel execution
checkCuda(cudaStreamAttachMemAsync(NULL, C, 0, cudaMemAttachHost));
checkCuda(cudaStreamSynchronize(NULL));

// Now we can access C without seg fault!

Topic		Replies	Views
NVProf error on samples CUDA Programming and Performance	28	20448	December 29, 2020
unified memory profiling failed Visual Profiler and nvprof	12	6107	June 17, 2018
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2799	March 23, 2018
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2557	January 31, 2019
Unified Memory Signal 139 Cuda 10.1 Visual Profiler and nvprof	3	1391	August 6, 2020
NVPROF is causing system instability and requiring reboot CUDA Programming and Performance	8	915	February 19, 2018
Magic of nvprof --profile-api-trace none Visual Profiler and nvprof	4	890	March 27, 2023
[RESOLVED] Profiling error 4168:999 Visual Profiler and nvprof	34	11239	September 19, 2020
nvprof never returns CUDA Programming and Performance	8	6306	March 30, 2016
nvprof: incompatible CUDA driver version on TX2 Jetson TX2	12	3204	October 18, 2021

nvprof error code 139 but memcheck OK

Related topics