I’m not sure what information to provide, because my code was working an iteration or two before. I rewrote my kernels and the code is drastically slower (~5 times), which it shouldn’t be. But, more concerningly, when I run nvprof on it, it instantly quits and says
==147280== Profiling result: No kernels were profiled. ==147280== API calls: No API activities were profiled. ==147280== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data. ======== Error: Application received signal 139
I run cudaDeviceSyncrhonize() at the end of the code, but it clearly does not reach that point - it crashes before any kernels are called. And when I run it with cuda-memcheck, it’s very, very slow. Like, 100-1000’s of times slower.
Please let me know if there’s any helpful information I can provide. The only difference between my formerly perfectly functioning code and current code is using different memory within kernels and a different grid size.