GPU and CPU don't run in (pure) parallel ?

What I am doing is to combine GPU and CPU to solve problem. A thread is created to controls the CUDA’s computing, So that GPU and CPU can run in parallel. I wish GPU would bring as little overhead as possible to CPU, but the fact is, GPU slows down CPU a lot.

Suppose,GPU and CPU run in parallel, if CPU program runs for 10 minutes. GPU runs for 4 minutes and they start at the same time, the overall time should be 10 minutes(roughly,we don’t consider the initialization and starting of computation).

On the contrary, if GPU is not running in parallel with CPU (i.e. when GPU is computing it needs some help from CPU and hence takes some CPU time), in the above example, the overall running time could be 13 minutes.

Here is how I do to reproduce this problem:

I run a CPU function which takes more than 10 minutes, at the same time, either a light weight kernel or a heavy weight kernel run for 1000 times, both GPU tasks take less than 5 minutes to finish.

the running time for CPU + light kernel is about 11 minutes, but the running time for CPU+heavy kernel is more than 12 minutes. it seems heavy kernel takes more CPU time than light kernel. In other words, when GPU is running, it slows down and CPU. i.e. GPU and CPU are not running in (pure) parallel.

below is the two kernels,they are operating on the GPU memory :

//light weight kernel
global void light_kernel( int* memA,intmemB ) {
for ( int i = 0 ;i < 4 ; ++i )
memA[i] = memB[i]=9;
}
//heavy weight kernel
global void heavy_kernel( int
memA,int*memB ) {
for ( int i = 0 ;i < 1024 * 256 ;++i )
memA[i] = memB[i] = 99;
}

Any hints ?
Thanks.

The CUDA runtime is completely synchronous in the current version - when you execute a kernel it doesn’t return control to the CPU until GPU processing is completed.

Asynchronous execution (and memory transfers) are planned for a future version.

I do know this, What I want to know is, if I create a thread which manage the CUDA computation and run this thread concurrently with CPU, will CUDA slow down the CPU ?

So you mean you have one thread to run your CUDA context, and one for other CPU work (two threads total)? This will only give you a speedup if you have multiple cores in your CPU.

Mark

To be precise, I have only one thread(GPU thread,which controls the GPU). i.e. the main CPU process runs on CPU, and create A thread running CUDA. I wish, when CUDA is computing, the GPU thread will be blocked, and it won’t take any CPU time, so that CPU and GPU can run in parallel.

I believe what you wish to happen is exactly what will happen - the GPU thread will block while the CPU work thread merrily computes away. Keep in mind that you’d need a kernel that consumes more time than the startup overheard of these threads to make this worthwhile. Your example above should serve nicely.

Sorry for not making myself clear. the problem is that GPU is not PURELY running in parallel with CPU,meaning that, when GPU is executing, it needs help of CPU and slows the CPU down. This is really weird. Can anybody tell me why it happens ?

thanks.

It sounds like you’re somehow not actually running on the GPU. So how are you breaking this task up? What does your grid look like, ditto for threads per grid element?

Here is my Code :

/****************************************

These are interface function between CPU and GPU

*****************************************/

//transfer data from GPU to CPU

extern “C” void cuda_g2c(int *src,int * dest, int mem_size ){

 CUDA_SAFE_CALL(cudaMemcpy(dest, src, mem_size,

                          cudaMemcpyDeviceToHost) );

 CUT_CHECK_ERROR("memry copy cuda_g2c failed");                             

};

//transfer data from CPU to GPU

extern “C” void cuda_c2g(int *src,int * dest, int mem_size ){

 CUDA_SAFE_CALL(cudaMemcpy(dest, src, mem_size,

                          cudaMemcpyHostToDevice) );

 CUT_CHECK_ERROR("memry copy cuda_c2g failed");                             

}

//free GPU memory

extern “C” void cuda_free_mem( char * mem ){

CUT_CHECK_DEVICE();

CUDA_SAFE_CALL(cudaFree(mem));

CUT_CHECK_ERROR("Free memory failed");    

}

//allocate GPU memory

extern “C” char * cuda_alloc_mem(int mem_size ){

CUT_CHECK_DEVICE();

char* d_data;

CUDA_SAFE_CALL(cudaMalloc((void**) &d_data, mem_size));

CUDA_SAFE_CALL(cudaMemset(d_data,0, mem_size));

CUT_CHECK_ERROR("Allocate memory failed");    

return  d_data ; 

}

//start light weight kernel computing

extern “C” void run_light_kernel( int* mem_a,int* mem_b ){

CUT_CHECK_DEVICE();

dim3 grid(1, 1);

dim3 threads(8, 16);

light_kernel<<< grid, threads >>>((int*)mem_a,(int*)mem_b);

CUT_CHECK_ERROR("after run_light_kernel");

}

//start heavy weight kernel computing

extern “C” void run_heavy_kernel( int* mem_a,int* mem_b ){

CUT_CHECK_DEVICE();

dim3 grid(1, 1);

dim3 threads(8, 16);

heavy_kernel<<< grid, threads >>>((int*)mem_a,(int*)mem_b);

CUT_CHECK_ERROR("after run_heavy_kernel");

}

/******************************************************

This is the main program. It creates a thread using pthread lib and then,

execute a CPU function which take much longer time than that of CUDA

initialization and execution of kernel.

******************************************************/

include

include"Windows.h"

include"pthread.h"

//define CPU_ALONE

//define LIGHT_KERNEL

int * memA,*memB,*cpu_mem ;

//program runs on the CPU. just keep CPU running

//(running time >> CUDA initialization + kernel execution

void cpu_idle(){

int *  A = new int[1024*1024 * 4] ; 

int *  B = new int[1024*1024 * 4] ; 



if ((!A)||(!B)) exit (-100)  ; 

for ( int i = 0 ;i < 6*6*128 ; ++i) { 

	for ( int i = 0 ;i < 1024 * 1024 * 4 ; ++i ){

		A[i] = B[i%128] + B[i%512] + B[i%2048] + B[i%4096] ; 

		B[i] = A[i%128] + A[i%512] + A[i%2048] + A[i%4096] ; 

	}

}

delete [] A ; 

delete [] B ; 

}

/******************************************************

this thread(gpu_thread) start the GPU computation.

*******************************************************/

void* slave_thread(void * S ) {

int loop_count = 0 ;

//allocate two chunk of GPU memory 

memA = (int*)cuda_alloc_mem(1024 * 1024 * 2) ;    

memB = (int*)cuda_alloc_mem(1024 * 1024 * 2) ;    

//temporary buffer in GPU memory 

cpu_mem = new int[1024 * 1024 * 4 ] ; 

//let GPU compute for specific times. 

//you can choose either light kernel or heavy kernel 

while( loop_count++ < 1000) { 

ifdef LIGHT_KERNEL

	run_light_kernel(memA,memB) ; 

else

	run_heavy_kernel(memA,memB) ; 

endif

} 



//you can read back the data, and verify that GPU does do computation. 

// cuda_g2c(memA,cpu_mem,1024 * 1024 * 2) ;

cuda_free_mem((char*)memA) ; 

cuda_free_mem((char*)memB) ; 

delete [] cpu_mem ; 

fprintf(stdout," Thread running times : %i\n ",loop_count ) ; 

return (char*) S ; 

}

int main(int argc, char** argv){

double begin_time,end_time ; 

pthread_t thread;

begin_time = clock() ; 

#ifndef CPU_ALONE

int Slave = pthread_create( &thread, NULL, slave_thread, NULL); 

endif

cpu_idle(); 

#ifndef CPU_ALONE

pthread_join( thread, NULL);

endif

end_time = clock() ;  

fprintf(stdout, "Total time : %lf", end_time - begin_time) ; 

return 1; 

}

BTW, I am using Release configuration in Visual C++ 7.0. and the data read back from GPU shows the GPU is doing computation.

Thanks, I’ll try running this locally and see what I see.

One comment though: you’re only running on one of the 16 processors on the G80. you’re probably aware of that, but just in case you aren’t, I’m pointing out that you’re getting ~1/16th or so of what you could get out of it :-).

So could you try changing the grid size to (4,4) and see if that has any effect on running time (it mostly shouldn’t). If you see the GPU task take ~16x longer, then somehow you’re not running on your GPU. Are you running from within X windows BTW?

Thanks, it seems you are right.

I change grid size to (4,4), now, it takes twice of original time. i.e. if grid is (1,1),10 minutes; if grid is (4,4),20 minutes. really weird.

If I don’t run CPU program, i.e. run GPU function alone. Kernel with grid(4,4) is about 6x times slower than the same kernel with grid(1,1).

My mail is ymzhang@cc.gatech.edu, send me a mail so that, I can send the whole source code to you.

BTW,I am running Winxp 32-bit and I made a mistake, I am now compiling the project in Debug mode.

Hi,

looking at your code and the description in one of your posts of what you are doing with the kernels (repeating them 1000 times), I think your issue is that the two CPU threads are competing for the CPU. The reason is that your gpu thread requires CPU time for each of the 1000 while-loop iterations. It only blocks while the cuda function call is executing, but the CPU is needed to call the device kernel function.

I’m willing to bet that if you had a dual-core machine, you’d see the speedup you expect. Also, you can try to modify your kernel codes to repeat them 1000 times wihout returning to the CPU (at the same time getting rid of the while-loop and calling the cuda kernel once). That too should solve the problem. Let me know if it works.

Paulius

it seems you misunderstand my question. I do know starting the GPU computaion will kill some CPU time, and what I want to know is , why GPU kills CPU time, when it is computing.

While the G80 is computing, the CPU thread that controls CUDA should be blocked and swapped out by the OS, thus not consuming CPU cycles. However, it will consume CPU cycles every time it’s swapped back in to launch the kernel. These swaps are not necessaritly instantenious. Try my suggestion to run all 1000 iterations from the kernel and launch the kernel exactly once (so that GPU work amount is the same). You should observe a much better speedup from CPU’s point of view, let me know if that happens.

Synchronization is part of overhead in any parallel programming. If you want to test how well the GPU and CPU overlap their computations, have the minimal number of communications between the two. In your code you’re asking the CPU to do two things - run its own computation and invoke the CUDA kernel 1000 times. Think of it this way - you have to read a chapter of a book. You’ll finish it faster if you have to get up and open the door to let someone in only once, rather than having to get up 1000 different times.

Paulius

Paulius is right. I learnt that when I start writing cuda apps. Also, ymzhang’s cuda app does not seem to spread the load across all the threads. In moving data to and from the gpu and invokations are very expensive.

I have observed with my own CUDA test kernels that the current implementation loads the host CPU at 100% while the CUDA kernel is running, at least on the Linux version. My GPU kernels often take 2 seconds per invocation, so this CPU load is definitely coming from something within the CUDA runtime itself, as my code is simply waiting on the CUDA kernel to finish. I’ve already reported this bug and it has been reproduced one of the NVIDIA staff (last week).

John Stone

John, did you get any indication from Nvidia whether this “bug” would be fixed in next month’s release? It’s pretty important!
It sounds like the CUDA driver is spinning on device status while a kernel is running - is it possible to fix this properly ie can the 8800 cards issue CPU interrupts?
(this solution is OK for a beta but would be a problem after)

Hi,

I don’t have any additional information, but I’m sure that the NVIDIA engineers will address this when they can. I did watch my processes with ‘strace’ and from gdb etc to see if I could determine what the host thread was doing while the GPU kernel is running, it seems to be spinning in a CUDA runtime function of some sort. Just the fact that the Linux kernel is counting this as user time rather than system time seemed to indicate to me that the spinning is occuring in userland code, perhaps reading a memory mapped device register or something like that. In any case, this type of thing used to occur with OpenGL many years ago, and they obviously know how to avoid it in that case, so probably they just need time to work on it. For all we know they’ve already long since solved this and have moved onto other feature requests :-)

Cheers,

John Stone

Bump… Can we please have some guidance from Nvidia!
There should be a disclaimer in BIG FRIENDLY LETTERS at the front of the Programming Guide warning system designers about this feature.
IMHO the correct solution for Unix is to have an open fd for each device (context if we are allowed more than one/device at some time in the future) and allow device code to write single bytes across. Then we can use select/poll to synchronise. On Windows I think one needs to expose an I/O completion port (I am not a Windows engineer). All this is NOT a quick fix and might entail a separate CUDA driver (no longer piggybacking on the video driver). The present system might work this way already and there really is a typo somewhere, unlikely. Any fixes in the upcoming release?
Thanks, Eric

Rest assured, we are working on this problem. It will likely get better incrementally from release to release.

Mark