I am trying to process a data file in parallel. I would like to skip thread zero on block zero.
if ( ( (blockIdx.x + threadIdx.x) == 0) )
return;
I break on launch and and debug step until it hits the return statement. The debugger appears to pass over the code for that thread, but i inspect my data struct and that thread had to have executed.
I initialized my struct data on the cpu before copying it to the device so I would know if it got changed.
The data in the struct after that thread returns is the data in the binary file that was supposed to get skipped.
The kernel completes, i copy the struct back to the host and << into a file.
Sure enough that thread ran. My output file doesn’t have my XXX that i initialized the struct with. The data in the struct was exactly the data from the file.
I’m on Windows 10 fully updated with updated MSVS community 2022 and Cuda 12.6. I would post more specifics but I hacked a work around and fixed a bug so I’m moving on.
My main frustration is that somehow when I made changes to my kernel.cu, those changes aren’t always being reflected when debugging the gpu.
For example, I added a block like this to my kernel:
int foo = (blockIdx * numOfThreadsPerBlock + threadIdx.x);
commented out that block, saved all files, Clean Solution, Rebuild Solution, cudaDeviceReset() on entry and still the Next-Gen was hitting that ‘breakpoint’, even though I had removed the breakpoint. I could even watch the changes in ‘foo’.
Don’t get me wrong, I love much of what Nvidia has done. And I get that the integration with Windows is robust and the problem is on my end. I’m just saying I felt so hopeless that I fired up a dual boot of Ubuntu 20 and seriously considered abandoning MSVC and Windows in favor of Linux last night.
Things seem to be working fine this morning. One of my cards have been going N/A lately even when running on 50% power and 67 degrees so maybe that’s causing issues.
Thank you for responding so fast. If you have any tips on extra caches or temporary intermediate files I can clean out if I run into trouble that might be helpful.
Now that you clarified you are talking about CUDA kernels (not Windows or Linux etc.) I can also refer you to the correct place. We do have a dedicated CUDA category for programming and performance issues. They are really helpful and might have those suggestions you are looking for.
The hit breakpoint after commenting could be an actual reached code directly above or below the commented line.
Also for using the debugger, you better do a debug build to prevent too much of the code being optimized and rearranged.
Sometimes it also helps to put printf into your kernel code. It is comparitively slow, so only for debugging; but a nice feature from Nvidia. The output is not always sorted in order between the threads, but you could select the threads with