Code runs w/o debugging on, runs incredibly slow with cuda-memcheck; no kernels called with nvprof;

I’m not sure what information to provide, because my code was working an iteration or two before. I rewrote my kernels and the code is drastically slower (~5 times), which it shouldn’t be. But, more concerningly, when I run nvprof on it, it instantly quits and says

==147280== Profiling result:
No kernels were profiled.

==147280== API calls:
No API activities were profiled.
==147280== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

I run cudaDeviceSyncrhonize() at the end of the code, but it clearly does not reach that point - it crashes before any kernels are called. And when I run it with cuda-memcheck, it’s very, very slow. Like, 100-1000’s of times slower.

Please let me know if there’s any helpful information I can provide. The only difference between my formerly perfectly functioning code and current code is using different memory within kernels and a different grid size.

Never mind; for some reason, it was because I wasn’t declaring some arrays in my kernels as shared. My understanding that not doing so would hopefully put the array in registers, since the indices are compile-time constants. Why wouldn’t this be the case - why would this be slower? And why would it break things as above?

Edit: for concreteness, I am declaring an array simply as

double array[blockDim.x]

in which I put some computed values. This is what I was hoping would be stored in registers, but perhaps I am misunderstanding how registers work.

And nvprof still won’t run, no idea why.

how the array can be placed to registers if its size is unknown at compile time?

Of course; I was hasty in the above. I am indeed using a #define’d compile-time constant to set the shared array sizes. The size is anywhere between 2 x 6^3 and 2 x 10^3 doubles, so ~16 kb at most - is that too big?

I suppose my hope in defining an array within the kernel (without designating shared or otherwise) was that the entire array would automatically be put in registers. Since that particular array is only used locally (i.e. within its own thread), I figured that would be optimal. I suppose I can just refer to the global array pointer and trust the compiler to use registers appropriately?

Also, nvprof still won’t run on my code, which by all indications runs properly and launches kernels when not using nvprof.

shared array by no means is placed in shared memory. it can’t be placed into registers since each thread in the block should be able to access it

sm 5.0 and above supports only 255 registers per thread, i.e. 127 doubles. note that without shared designator, each thread allocs its own version of array, independent of other threads

also array cannot be placed to registers if indexes are computed at runtime

Thank you for the response. Would you mind clarifying how shared might not put the array in shared memory? If not there, and not in registers, where does it go? I thought shared memory was for data which needs to be shared between multiple threads, and hence would go in its own cache.

I suppose I have never thought about using data in a kernel that’s not global or shared. If I want to compute and store one value per thread, I can just declare it as

double x;

, do something like

x = threadIdx.x

, and each thread will have a local copy of x with its own value? I.e. x is private by default?

Thank again you for clarifying!

yes, function locals are private to thread, just like to usual CPU programming. overall, read CUDA manual or any handbook - it’s very basic thing you should know to use CUDA

shared vars are always palced into special shared memory which is limited to 48 KB per block

Any idea why an application would break when run with nvprof (“signal 139”) but execute fine on its own?