I’m a Phd student in Computer Vision, and I’m in the process of the converting pure C++ image processing programs into C++/CUDA.
I’m facing extreme difficulty mainly in parallelising the programs. Perhaps my idea of the whole thing is a little off, but I assume that when random access to any location in an image is required within any CUDA block, then it is quicker to run it on a multicore CPU with a fast clock? I do notice when I do this though that although my (probably poorly written) GPU programs perform much slower than the CPU in these cases, using the CPU results in the one of the cores to max out and the other to go halfway, and the notebook fan goes full speed. I’m scared to run programs like this for extended periods in case I burn some circuitry inside.
I’m still struggling to understand how best to organise blocks and threads, but meanwhile, running deviceQuery tells me max block size is 512x512x64. Does that mean I can load an entire 512x512x3(channel) RGB image within a single block? I experimented with running on a single block and my program crashed, so clearly I did something terribly wrong.
Is there a way to execute a CUDA thread in the background while the main C++ program continues to do other things while waiting for CUDA results? If that’s possible then I can squeeze productivity even from one of my slow GPU programs, simply by getting the CPU to accomplish other tasks while the GPU’s at it.
Lastly, could anyone point me in the right direction to go about organising blocks and threads if I want to calculate for instance the euclidean distance between the local histogram in a 3x3 window centered on each pixel and the entire image histogram, for a 3 channel image?