Too much RunTime

Dear All

I have several consecutive kernels in each OpenMP thread of the type:

//Allocate 16 streams
#pragma omp parallel num_threads(16)
{
z5=omp_get_thread_num();
kernel1<<<SYMB/32,32,0,stream[z5]>>>(…);
process1<<<NRSAMPLES/32,32,0,stream[z5]>>>(…);


}


I allocate and pass all the data to the GPU before counting the time. And pass from GPU only a small result array. I am using CUDA 6.5 but I am passing explicitly the data for the GPU.

I get about 300ms in a K40 and 400ms in a GeForce 740M. I reduced the processing by 4 and pass from double precision computation to single precision with gains of about 20-30ms only.

I am suspecting that perhaps this is time of compilation or offload of the kernels. Can it be? If so how I compile and offload the kernels at the beginning of the host program and it will not needed to do the same again.

Thanks

Luis Gonçalves

I would suggest using the CUDA profiler to find out where time is spent. since you aren’t showing your code, it is not clear what is included in the timed portion of your code. for all we know it may include CUDA context creation time, or even file I/O. Are you running the GPU in persistent mode to avoid unloading/reloading of the driver?

If you use the -arch switch of nvcc to compile your code, both PTX and machine code (SASS) for your architecture will be embedded in the resulting executable. As long as SASS matching the GPU architecture is available in the binary, there won’t be any JIT compilation at CUDA context creation, the driver will simply load the binary machine code into GPU memory, which doesn’t take much time (certainly not 100s of milliseconds). If you need to target more than one architecture you can easily build a “fat binary” that includes SASS for multiple architectures using the -gencode switch. Example for a CC3.0 device like the GTX 740M plus a CC 3.5 device like the K40: -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35

I don’t know what you mean by “I reduced the processing by 4”. Do you mean you ran a problem of 1/4 the size that (assuming time complexity of the code is O(n)) should run in 1/4 the time? If so, the relatively small reduction in time that resulted would seem to indicate that kernel execution time is a minor portion of overall measured execution time. Again, the profiler will help you find out how much time is spent in various activities such as kernel execution and host/device copies.