cuFFT runs slower inside pthread, but why?

I had my CUDA code in the main function:

int main()
{
… // CUDA code
}

and I just moved it to a (single) pthread:

void* thread_func(void* arg)
{
… // CUDA code
}

int main()
{
pthread_t thread_id;
pthread_create(&thread_id, NULL, thread_func, NULL);
pthread_join(thread_id, NULL);
return 0;
}

I was quite surprised to find out that it became noticeably slower (from 4ms in the main thread to 5ms in the pthread).

Given that both versions are single-threaded, what could be the explanation for this?

is your time measurement for cuda/ device side related work only?

accounting for the overhead to create another thread would likely result in a longer time span, although i can not think of a reason why the device side related work in itself should take longer

My CUDA code is the typical cudaSetDevice, cudaMalloc, cudaMemcpy, kernel calls, etc.

I’m timing the code inside thread_func(), so any overhead of creating the thread has already occurred before that.

It appears that the code is slower just because it’s running inside thread_func() instead of main() !?

you mention ‘moving the work out’, so i assume main has no cuda work left?

you probably could/ should test whether the degradation is an ‘once-off cost’, or a ‘continuous cost’, by increasing the work within the new thread, by a factor of 2, or even 10 (doing the same work 2 or 10 times)

i suspect that discrepancy in time would normalize, indicating that it is merely a once-off ‘penalty’
the penalty may be host memory related (moving to a new core) or OS scheduling related perhaps

Well, I ran a simple experiment with some simpler CUDA code, and I didn’t find any difference in performance from running that code inside main() or inside a pthread.

So it must be a problem with my original code. I will have a deeper look into it in the next few days.

This recent NVIDIA ∥∀ blog post might be relevant: GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

It might also be helpful to run "nvprof --print-gpu-trace " to see the duration of each individual kernel launch.

Thanks for the tips.

I have narrowed down the problem to cuFFT.

Please have a look at these two simple programs:

https://dl.dropboxusercontent.com/u/478898/test1.cu
https://dl.dropboxusercontent.com/u/478898/test2.cu

In test1 the code is inside main(), and in test2 it’s inside the pthread.

Compilation is straightforward:

nvcc -arch=sm_50 test1.cu -lcufft -o test1
nvcc -arch=sm_50 test2.cu -lcufft -o test2

On my machine (i5-4690 + GTX 750 Ti), test1 runs in about 0.54ms and test2 runs in about 0.77ms (you may have to run several times, or increase the NRUNS constant).

This points to a consistent difference in performance between both versions for no apparent reason…

Any explanation for this?

Looking at your code, the timed portion comprises more than just the FFT computation. As a starting point, you would want to figure out the time taken by each of the multiple activities currently in the timed section and determine which one of these runs slower with pthreads.

For example, there is currently memory allocation and de-allocation in the timed section. I would suggest moving them out of the loop altogether, making them one-time activities. Then there is CUFFT functionality that executes on the host, such as plan building and destruction. I would suggest adding more calls to the high-resolution timer to determine how much time each of those takes. It may also be a good idea to perform plan building and destruction only once, outside the loop. To get an accurate timing of the FFT computation itself (which happens on the device), I would suggest the sequence:

start = seconds();
cudaDeviceSynchronize();  // make sure all previous GPU activity has completed
cufftExecC2C(plan, d_c, d_fft, CUFFT_FORWARD);
cudaDeviceSynchronize();  // wait until FFT computation done
stop = seconds();

Note that cudaDeviceSynchronize() itself takes time, so you may want to calibrate the above sequence by timing just the back-to-back cudaDeviceSynchronize() calls.

Instead of (or in addition to) doing your own time measurements, you can also use the Visual Profiler.

My guess is that the timing differences you are seeing are due to host-side activities impacted by the use of phtreads, such as the speed of memory allocation. Just a working hypothesis for now, but in my experience malloc/free are often slower in multi-threaded environments since the allocators use locks.

@njuffa: thanks for the systematic approach.

Here are the results:

_		main		pthread
cufftPlanMany	0.000364	0.000593	seconds
cudaMalloc	0.000059	0.000059
cufftExecC2C	0.000013	0.000013
cufftDestroy	0.000069	0.000069
cudaFree	0.000036	0.000036
Total		0.000541	0.000770

The problem is in plan generation. Your theory may be right, if cufftPlan is doing some malloc/free on the host side, which is probably true.

I’m giving up hope of being able to overcome this performance difference, but just for the sake of going to the bottom of this, is there some way I could monitor what cufftPlan is doing on the host side?