is your time measurement for cuda/ device side related work only?
accounting for the overhead to create another thread would likely result in a longer time span, although i can not think of a reason why the device side related work in itself should take longer
you mention ‘moving the work out’, so i assume main has no cuda work left?
you probably could/ should test whether the degradation is an ‘once-off cost’, or a ‘continuous cost’, by increasing the work within the new thread, by a factor of 2, or even 10 (doing the same work 2 or 10 times)
i suspect that discrepancy in time would normalize, indicating that it is merely a once-off ‘penalty’
the penalty may be host memory related (moving to a new core) or OS scheduling related perhaps
Well, I ran a simple experiment with some simpler CUDA code, and I didn’t find any difference in performance from running that code inside main() or inside a pthread.
So it must be a problem with my original code. I will have a deeper look into it in the next few days.
On my machine (i5-4690 + GTX 750 Ti), test1 runs in about 0.54ms and test2 runs in about 0.77ms (you may have to run several times, or increase the NRUNS constant).
This points to a consistent difference in performance between both versions for no apparent reason…
Looking at your code, the timed portion comprises more than just the FFT computation. As a starting point, you would want to figure out the time taken by each of the multiple activities currently in the timed section and determine which one of these runs slower with pthreads.
For example, there is currently memory allocation and de-allocation in the timed section. I would suggest moving them out of the loop altogether, making them one-time activities. Then there is CUFFT functionality that executes on the host, such as plan building and destruction. I would suggest adding more calls to the high-resolution timer to determine how much time each of those takes. It may also be a good idea to perform plan building and destruction only once, outside the loop. To get an accurate timing of the FFT computation itself (which happens on the device), I would suggest the sequence:
start = seconds();
cudaDeviceSynchronize(); // make sure all previous GPU activity has completed
cufftExecC2C(plan, d_c, d_fft, CUFFT_FORWARD);
cudaDeviceSynchronize(); // wait until FFT computation done
stop = seconds();
Note that cudaDeviceSynchronize() itself takes time, so you may want to calibrate the above sequence by timing just the back-to-back cudaDeviceSynchronize() calls.
Instead of (or in addition to) doing your own time measurements, you can also use the Visual Profiler.
My guess is that the timing differences you are seeing are due to host-side activities impacted by the use of phtreads, such as the speed of memory allocation. Just a working hypothesis for now, but in my experience malloc/free are often slower in multi-threaded environments since the allocators use locks.
The problem is in plan generation. Your theory may be right, if cufftPlan is doing some malloc/free on the host side, which is probably true.
I’m giving up hope of being able to overcome this performance difference, but just for the sake of going to the bottom of this, is there some way I could monitor what cufftPlan is doing on the host side?