==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works

Hi,

I really need help.
I believe it’s a nsight compute bug but somehow specifically related to my code. The code has multiple kernels but none can be profiled by nsight compute. It’s built with --rdc=true, don’t if it matters.

I have windows 10 and 11 environment and already tested on visual studio 2019, GTX 2080 and GTX 3080, CUDA 11.4,11.1, 10.0, tried compute,sm=52,70,75,86… all with the latest nsight compute
The same error code 0xc00000fd (3221225725) comes up, I have no idea why.

Otherwsie, the code runs fine and no memory leak detected from cuda-memcheck.
Nsys works but I need kernel level profiling since both GPU is no longer supported by nvprof I have to use nsight compute.

I also tested SobolQRNG in the CUDA samples with the setup, works fine.

Nsight compute output:

Launched process: ncu.exe (pid: 24364)

C:/Program Files/NVIDIA Corporation/Nsight Compute 2021.2.2/target/windows-desktop-win7-x64/ncu.exe --export “C:/Users/gueux/Documents/NVIDIA Nsight Compute/profile_%i” --force-overwrite --target-processes application-only --replay-mode kernel --kernel-name-base function --launch-skip-before-match 0 --section ComputeWorkloadAnalysis --section LaunchStats --section Occupancy --section SpeedOfLight --sampling-interval auto --sampling-max-passes 5 --sampling-buffer-size 33554432 --profile-from-start 1 --cache-control all --clock-control base --apply-rules yes --import-source no --check-exit-code yes C:/Users/gueux/source/repos/patchV1/x64/Debug\patchV1.exe -c C:\Users\gueux\Repos\patchV1\src\profiling.cfg

Launch succeeded.

Profiling…

==PROF== Connected to process 25172 (C:\Users\gueux\source\repos\patchV1\x64\Debug\patchV1.exe)

==ERROR== Failed to prepare kernel for profiling

==ERROR== Failed to profile kernel “logRand_init” in process 25172

==ERROR== The application returned an error code (3221225725).

==ERROR== An error occurred while trying to profile.

==WARNING== No kernels were profiled.

==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Process terminated.

This seems to be the general error code for stack overflow in applications in windows.

Increasing stacksize limit and recompile does not work either.

I suggest to first understand if the issue is specific to memory save-and-restore, any specific metric, or something else about the application or setup. For this, you can try to collect only a single-pass metric using the --metrics option. Most of the other options except for --section and --force-overwrite are the default values, so you don’t need to specify them explicitly.

ncu --metrics gpc__cycles_elapsed.sum <app>

Should that work, you can try to see if a more limited metric set, but with kernel replay, does work, e.g.

ncu --section SpeedOfLight <app>

Also, try the same set but with application replay instead of kernel replay by setting

ncu --section <sections> --replay-mode application

This will re-run the entire app process multiple times, collecting one set of metric counters in each pass. It assumes that the application execution is deterministic.

1 Like

Thanks for the suggestions, but unfortunately, “ncu --metrics gpc__cycles_elapsed.sum ” and the later ones all give the same error as before.

This suggests that either there is an issue with your overall setup (which it seems is not the case, as you mentioned you can profile other samples on the same system, correct?), or there is an issue in the way that Nsight Compute interacts with your specific code.

  • Is it possible for you to share your code, assuming that it is straightforward enough for other to build?
  • Can you create a memory dump with Visual Studio in the debugger and share this with us?
  • Does your code do anything “unusual”, e.g. call CUDA from a library initializer, call other CUDA libraries, …?
  • Can you try to isolate any of your kernels in a minimal application and check if that can be profiled?
  • Which exact Nvidia driver version are you using?
  • The code can be build easily with 2 libraries from boost. But in order to run it, one needs a complicated bunch of input data files, and the code … may be hard for others to decipher.

  • it runs fine on visual studio, but anyway here’s the link to memory dump (without heap, otherwise too big) mid running: patchV1.dmp - Google Drive
    It seems that I’m not allowed to dump during breakpoints in CUDA kernel, so this one was dumped right after the kernel that I want to profile had finished its first iteration.

  • It does nothing unusual.

  • I shall definitely try that.

  • driver version: 472.12

it runs fine on visual studio, but anyway here’s the link to memory dump

From your previous log, I would have expected the target application that is launched by ncu to crash while being profiled. Are you saying it crashes when profiled outside of VS, but profiles fine when being run under the debugger? The minidump doesn’t appear to be captured at an exception.

Yes, that is the case as I quoted below. it runs fine outside nsight compute, alone or with the vs debugger. the memory dump is on a inserted breakpoint as no exception is induced.

New Clue!

Tried just this option though, gives at least one iteration of profiling, that’s like 3 kernels plus one target-of-interest kernel run before it crashes in the same way (note the 0.1% in the output):

ncu --export <export_file> --force-overwrite --profile-from-start 0 <app_with_args>

The program output (the first bold line is where ncu with other options stopped working):

**logRand_init<<<1, 1024>>>**
LGN initialized
1024 V1 neurons, x:[-1, 1], y:[-1, 1] mm
 mE = 768, mI = 256
nChunk is reduced to 1 (nblock)
1 chunks in total, the first 0 chunks have 1 blocks each, the others have 1 blocks
reading connectome from D:/scratch/patchV1/resource/V1_conMat
single blockMat of conMat, delayMat and gapMat cost 8.25Mb
single chunk of conDelayGapMat requires at most 8.25Mb, smaller chunks require 8.25Mb
matConcurrency is reduced to 1, the same as nChunk
matConcurrency of 1 chunks requires 8.25 Mb, total device gmem = 16383.5Mb
matSize = 1x1x1024x1024x4=4Mb
gap_matSize = 1x1x256x256x4=0.25Mb
1 == 1, entire conMat, delayMat and gapMat are pinned
mean gapS = 0
conMat, delayMat and gapMat set
spikeTrain retains spikes for 1 time steps for each neuron, calculated from a maximum connection distance 0 mm
vector connections set
        nFar = 0, nGapFar = 0
vector gap junctions set
conductance setup in chunks
synFail, receptor ratio set
ExcRatio type 0: [1, 1]
ExcRatio type 1: [1, 1]
tonicDep = [0.3, 0.3, 0.3]
1024, 282
sLGN = [0, 0.1201, 0.16] < 0.48
maximum LGN per V1: 282, 282 on average
LGN->V1 surface constructed
implementing LGN_surface requires 2.20703 Mb
v0 = [-64.97, -57.4021, -50.014]
w0 = [-0.0346692, -0.000117227, 0.0335278]
gFF0 = [0, 0, 0]
w, v, gFF...
gE, gI...
mean(gE0) =  0
mean(gI0) =  0
spiking... V1 initialized
find V1 spike and gFF in D:/scratch/patchV1/rawData_profiling.bin for learnData_FF
output file check, done
parvo temporal kernel retraces 256.000000 ms, samples 256 points, sample rate = 1000 Hz
magno temporal kernel retraces 128.000000 ms, samples 128 points, sample rate = 1000 Hz
m = 30, n = 10
3 exact phases in [0,1]: 0/3, 1/3, 2/3
0 + 0 / 3
33 + 1 / 3
66 + 2 / 3
=== texture memory required: 2x128x128 = 0.25MB
Using 16.6218 Mb from a total of 16383.5 Mb, remaining 16366.9 Mb
perStackSize = 128 set to 204800
totalHeapSize = 48Mb.
Heap size now is 48.000000 Mb; Stack size now is 0.195312 Mb
cuda memory all set.
store_PM(1)<<<392x1x1, 32x32x1>>>
convol parameters stored
iFramePhaseTail = 32
mFramePhaseTail = 16
presend spikes:
no near-neighbor spiking events in the time step
current status: 1, (0.5, 0.5)
simulation start:
switchNow = 1
**0.1%**==ERROR== Failed to prepare kernel for profiling
==ERROR== Failed to profile kernel "LGN_nonlinear" in process 35448
==ERROR== The application returned an error code (3221225725).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

without this --profile-from-start 0 option, the outputs are all like this, which stops at the first kernel of the program as shown in bold:

**logRand_init<<<1, 1024>>>**
==ERROR== Failed to prepare kernel for profiling
==ERROR== Failed to profile kernel "logRand_init" in process 8496
==ERROR== The application returned an error code (3221225725).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I guess its just ** but not bold font in preformatted text, XD. But these are just the first and last lines of the program output before ncu crashes.

So… is this a bug of nsight compute?

The bug is somehow fixed in the newest release of nsight compute on 10/20.

I suspect that the original bug comes from the cudaDeviceSetLimit of the stack and heap size since the ncu complains about the args to cudaDeviceSetLimit in the newest release.
After releasing the control over stack and heap, ncu runs without a problem.
Also the error code at first place points to stack overflow.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.