KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ?

nitin.life · August 8, 2009, 9:29am

HI

I have a strange problem… I have the following kernels calls, which I think should be qued in ~ 3 microseconds.

gpu_ford_phia<<<dimGrid,dimBlock>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[1]);

	 for(int i=0;i<12;i++)

	 {		   

	 gpu_mtranT_prod<<<dimGrid5,dimBlock5>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

	 gpu_T_prodmtran<<<dimGrid6,dimBlock6>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

	  gpu_mT_prod<<<dimGrid4,dimBlock4>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);   

	 gpu_ford_phic<<<dimGrid3,dimBlock3>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[2],(Bspace*)adds[1],i);

	 }

AND they are qued in 4e-3 seconds in cuda and nvcc version 2.0 which is fine.

Now if I add this kernel call

gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] ); // the GPU_R kernel does reduction sort of operations on matrices.

after the for loop of previous kernels the queuing takes 0.1 seconds ??

Hence, I am loosing all the speed up here as I want to do some CPU calculations before GPU finishes.

I then inserted cudaThreadSynchronize(); and saw that it takes the same time 0.1 seconds. External Image

hence the program actually waits for the last kernel to finsih before transferring the control to the cpu, even though am not using any synchronization. Why will this happen ?

I dont know why this is happening :wacko: , please anyone got any ideas ?

here is the pastebinlink to the gpu_r kernel if anyone thinks its due to something in the kernel (which I doubt it should be)…

thanks all

MisterAnderson42 · August 8, 2009, 11:44am

CUDA 2.0 is very old. The driver queing has been significantly improved since then. Try this with CUDA 2.3 and see if you get the same problem.

SPWorley · August 8, 2009, 2:35pm

Mr. Anderson is correct.

There is some undocumented limit to the number of queued kernels. You’re queuing up 48 of them already. Probably your change which boosted it to 60 pushed over this limit, so the final kernels you’re queuing will block until a few of the first kernels finishish so there’s room to push the last ones.

This limit is increased in CUDA 2.2 if I remember correctly… perhaps its even better in 2.3.

nitin.life · August 8, 2009, 8:31pm

Well I would have been happy this was the case but apparently even if i Just call this “one” kernel like this (only once) :

gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] );

it still takes 0.1 seconds (ideally it should just take <1 microsecond) :( ? and the execution time for this kernel is also 0.1 seconds (I checked with cuda events)

So I guess numbers of kernels qued are not the problem. I have no idea why is this happening and how it can be corrected ? is this a bug ?

Also I tired the single precision version on CUDA 2.3 same behavior :( :( .

thanks very much for all your help

stencil · August 8, 2009, 9:09pm

Are you using a profiler? If so, try disabling it - those can cause kernel launches to become synchronous sometimes in my experience.

nitin.life · August 8, 2009, 9:47pm

No am not using a profiler; am compiling the code like this

nvcc -c -O3 -arch=sm_13 -keep CUDA_DRIVER2.cu

and linking it to my FORTRAN code. The most strange thing is that only this kernel launches behaves like this :blink: , the others work just fine :unsure: .

Also anyone has any clue why kernel launches can become synchronous ?

thanks all

SPWorley · August 9, 2009, 12:02am

Post more of your code, especially everything between your timer start and end calls. It’s hard to analyze without seeing that.

Perhaps you’re calling some function like a cudaMemcopy, which gives you an implicit thread synchronization.

nitin.life · August 9, 2009, 12:50am

Thanks for the help. I have pasted both the kernel and host code on paste bin below are the two links with some description below.

here is the link to my HOST CODEin C. This code is called by my FORTRAN code

Once for allocating memory on gpu and getting gpu memory pointers (“allocate_gpu”). These pointers and then passed around from FORTRAN to other host cuda routines for all gpu stuff.
Again which copies the data to gpu and launches the kernel asynchronously (calulate_arrays_gpu) .

Am just timing around the kernels in my host code using a CPU timer (not shown in the host code) but its like this:

time_t t1,t2;

float ratio;

ratio = 1.0/CLOCKS_PER_SEC;	

	t1 = clock();

	

//gpu_ford_phia<<<dimGrid,dimBlock>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[1]);

//	for(int i=0;i<12;i++)*/

//	{		

//	gpu_ford_phib<<<dimGrid2,dimBlock2>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[2],(Bspace*)adds[1],i); // this routine is also has the same problem but its very fast so I don't mind it for now

//	gpu_mtranT_prod<<<dimGrid5,dimBlock5>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

//	gpu_T_prodmtran<<<dimGrid6,dimBlock6>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

//	  gpu_mT_prod<<<dimGrid4,dimBlock4>>>((Bspace*)adds[2],(BIGspace*)adds[3],i); // not updating global memory	

//	gpu_ford_phic<<<dimGrid3,dimBlock3>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[2],(Bspace*)adds[1],i);

//	}

	

gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] ); // here am just timing the main problem routine

t2 = clock();

printf("\nTime  %f  -> non blocking call time in seconds\n",  (t2-t1)*ratio);

here is the link to my KERNEL CODE (the main culprit routine is the last function gpu_r)

I found out that another routine in the middle is doing the same blocking launch buts it was very fast that I dint realizes it. The name of the other routine is “gpu_ford_phib”.

It has been highlighted with a comment follwing it in the host code. I tried to find similarity in the two culprit routines but I couldn’t find any ( one of them uses lot of local memory and other uses zero local memory, and no communication between threads) hence I don’t know why this would happen in these two routines.

I do have a memcpy to device but it should affect only at the next call to the host code from FORTRAN main code, as I only copy memory to the gpu and not from the gpu (am storing results in the gpu memory till the very end).

I am thankful for your help on this. External Media

SPWorley · August 9, 2009, 3:07am

Thanks for the code… but the host code you linked to doesn’t have the timers in it, and the host code snippet you pasted is unclear because of the commented-out code… do you mean that the effective code:

cudaMemcpy((vec_space*)adds[0],cpu_space,sizeof(vec_space),cudaMemcpyHostToDevice); 

t1 = clock();

gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] ); // here am just timing the main problem routine

t2 = clock();

shows a timer difference of > 100ms? (I included the memcpy because that’s important to show the queue is flushed before you started timing and launching kernels)

What makes you think your clock() timer call is accurate or precise? I’ve been burned before by surprisingly coarse timer resolution (thank you, OSX!).

Even in Windows, CLOCKS_PER_SEC is a coarse 1000.

If you use the cuda library timer, at least it will be a double-check that what you’re seeing is a launch issue and not a timer issue. They use a little rawer OS timers (in windows, the PerformanceCounters).

The other dumb-check to try is to make a no-op kernel, one that just immediately returns. Replace all your kernel calls with the no-op. Do you get similar delays?

If so you know its entirely a host-side problem.

It probably should be host-side, BTW… even if you had an uber-kernel it should be asynchronous. The no-op kernel is just a data point for debugging.

nitin.life · August 9, 2009, 4:28am

Thanks for the code… but the host code you linked to doesn’t have the timers in it, and the host code snippet you pasted is unclear because of the commented-out code… do you mean that the effective code:
cudaMemcpy((vec_space*)adds[0],cpu_space,sizeof(vec_space),cudaMemcpyHostToDevice); 

	t1 = clock();

	gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] ); // here am just timing the main problem routine

	t2 = clock();
shows a timer difference of > 100ms? (I included the memcpy because that’s important to show the queue is flushed before you started timing and launching kernels)

What makes you think your clock() timer call is accurate or precise? I’ve been burned before by surprisingly coarse timer resolution (thank you, OSX!).

Even in Windows, CLOCKS_PER_SEC is a coarse 1000.

If you use the cuda library timer, at least it will be a double-check that what you’re seeing is a launch issue and not a timer issue. They use a little rawer OS timers (in windows, the PerformanceCounters).

The other dumb-check to try is to make a no-op kernel, one that just immediately returns. Replace all your kernel calls with the no-op. Do you get similar delays?

If so you know its entirely a host-side problem.

It probably should be host-side, BTW… even if you had an uber-kernel it should be asynchronous. The no-op kernel is just a data point for debugging.

Hmm thanks for the input. I have the results of the uber and no-op kernel test.

Now here is what I did to test.

I used FORTRAN based timer ( cpu_time( ) ) over the cuda host function call, resolution is 1e-3. (relaible I have used it lot fo times before)

2) I removed memcpy operations to GPU so that no sync occurs between multiple func calls ( I know the answers will be wrong but I jsut wana check)

THE full kernel call with all kernel active:

gpu_ford_phia<<<dimGrid,dimBlock>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[1]);

	   for(int i=0;i<12;i++)

	   {		

	   gpu_ford_phib<<<dimGrid2,dimBlock2>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[2],(Bspace*)adds[1],i);

	   gpu_mtranT_prod<<<dimGrid5,dimBlock5>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

	   gpu_T_prodmtran<<<dimGrid6,dimBlock6>>>((Bspace*)adds[2],(BIGspace*)adds[3],i);

		 gpu_mT_prod<<<dimGrid4,dimBlock4>>>((Bspace*)adds[2],(BIGspace*)adds[3],i); // not updating global memory	

	   gpu_ford_phic<<<dimGrid3,dimBlock3>>>((vec_space*)adds[0],(BIGspace*)adds[3],(Bspace*)adds[2],(Bspace*)adds[1],i);

	   }

	   gpu_R<<<dimGridR,dimBlockR>>>((Bspace*)adds[2], upd*(N/Block_sizeR), (sol_space*)adds[4] );

I call this big call = “FULLKERNEL”

Now I give the results:

I test using cuda events like this

cudaEventRecord(start,0);

				FULLKERNEL

		cudaEventRecord(stop,0); 

   cudaEventSynchronize(stop); // commented if we are not synchornizing

	cudaEventElapsedTime(&elapsedTime,start,stop);

Over this I test using fortran timer which is immedialtely over this C function (note there are no memcpy operations in the host code in this test)

Test 1 :

→ with cudaEventSynchronize and full gpu_r kernel:

time measured with cuda events with cuda_Eevent_Synchornization (sec) 0.180971558

time measured by cpu FORTRAN timer 0.181972000000000

Test 2 :

→ WITHOUT synchronization with the full gpu_r kernel:

time measured with cuda events without cuda_Eevent_Synchornization (sec) 0.000000000

time measuerd by cpu fortran timer 0.182972000000000

Test 3 :

→ WITH synchronization and with the EMPTY (NO-OP) gpu_r kernel:

time measured with cuda events with cuda_Eevent_Synchornization (sec) 0.114623459

time measuerd by cpu fortran timer 0.115982000000000

Test 4 :

→ WITHOUT synchronization and with the EMPTY (NO-OP) gpu_r kernel:

time measured with cuda events without cuda_Eevent_Synchornization (sec) 0.000000000

time measuerd by cpu fortran timer 6.998000000000004E-003

/------------------------------------------------/

Now after seeing this am confused. FOR NO-OP gpu_r kernel the timings make sense.

But for full OP kernel the CPU delays and waits for the kernels to finish. But according to CUDAEVENTS the control is passed to the CPU. Its like the CPU waits to take the control from the GPU for some reason … so confused :-(

These results are free of memcpy and also they are for just one call to the host function from FORTRAN.

You have any better ideas on these results ?

thanks very much for all your help

SPWorley · August 9, 2009, 5:58am

The fortran and CUDA timers are interesting, it shows you have decent precision, but the mismatches in some cases are more clues too.

Important point: If you removed the memcopy call, then you increased, not decreased, uncertainty… how do you know the GPU work queue was empty before you started computing? That memcpy was giving you an implicit thread synchronize.
In fact your test 3 kind of shows that there IS work to do even with a no-op kernel… the synchronize call was waiting for SOMETHING and a no-op kernel would take a trivial 20us or so.

Here’s the next quikkie tests:

First, with your setup as before, try a test with NO kernel… not a no-op, but no kernel call at all. I bet that will act very close to the no-op kernel.
That will

Next, after that test, add a cudaThreadSynchronize BEFORE the timer start. This will flush copies and kernels so the queue will be empty and you know that there’s no work that could be influencing the timing test.

This smells like either startup costs and context initialization (which you seem to be aware of, you have a warmup call), or some unflushed kernels or transactions that somehow are still pending.

nitin.life · August 9, 2009, 7:29am

The fortran and CUDA timers are interesting, it shows you have decent precision, but the mismatches in some cases are more clues too.

Important point: If you removed the memcopy call, then you increased, not decreased, uncertainty… how do you know the GPU work queue was empty before you started computing? That memcpy was giving you an implicit thread synchronize.

In fact your test 3 kind of shows that there IS work to do even with a no-op kernel… the synchronize call was waiting for SOMETHING and a no-op kernel would take a trivial 20us or so.

Here’s the next quikkie tests:

First, with your setup as before, try a test with NO kernel… not a no-op, but no kernel call at all. I bet that will act very close to the no-op kernel.

That will

Next, after that test, add a cudaThreadSynchronize BEFORE the timer start. This will flush copies and kernels so the queue will be empty and you know that there’s no work that could be influencing the timing test.

This smells like either startup costs and context initialization (which you seem to be aware of, you have a warmup call), or some unflushed kernels or transactions that somehow are still pending.

Hmm I dont think it can be start up cost. I create context in another previous C function which is called from FORTRAN for allocating GPU memory. I assume the context will be created for the life time of the application hence I wont do CUDA setdevice again. But I get the same results as above if I call the function twice. If it was context creation I should have seen ideal performance in second fucntion call.

Now for the new tests you wanted:

Test 1) NO KERNEL CALLS , NO SYNCHRONIZATION

time measured with cuda events without cuda_Eevent_Synchornization (sec) 0.000000000

time measuerd by cpu fortran timer 9.989999999999999E-004

Test2) NO KERNEL CALLS, WITH SYNCHRONIZATION

time measured with cuda events without cuda_Eevent_Synchornization (sec) 0.000002976

time measuerd by cpu fortran timer 1.000000000000001E-003

The first two tests look fine I guess as u said.

Test3) cudaThreadSynchronize before the fortran timer (hence before the call to the host function) , full kernel call with in HOST function cudaEventSynchornize:

time measured with cuda events with cuda_Eevent_Synchornization (sec) 0.098825027

time measuerd by cpu fortran timer 9.898399999999999E-002

Test3) cudaThreadSynchronize before the fortran timer (hence before the call to the host function) , full kernel call withOUT in HOST function cudaEventSynchornize:

time measured with cuda events with cuda_Eevent_Synchornization (sec) 0.000000000

time measuerd by cpu fortran timer 9.998499999999999E-002

:( looks like there is no previous stuff waiting (I call malloc few seconds before this computation hence I think it gets over well before this).

Test4) cudaThreadSynchronize before the fortran timer (hence before the call to the host function) , kernel call WITHOUT gpu_r active and withOUT in HOST function cudaEventSynchornize:

time measured with cuda events with cuda_Eevent_Synchornization (sec) 0.000000000

time measuerd by cpu fortran timer 3.999000000000003E-003

Again on due to gpu_r kernel i am having prolems…

Am really now starting to wonder why is this the case ; some CUDA bug? or NVCC cant give control to the GPU due to that kernel ? I have lot of threadsync() in that GPU_R routine as am doing thread recursion on some matrix (6 by 6) times tensor (6 by 6 by 6) time matrix transpose + more calcualtions…

But I dont see nay reason why would this kernel effect the thread scheduler or kernel launcher ?

also I even tried on cuda 2.3 with the single precision implementation of the same implementation … same bad results :wacko: …

Thanks for all the help man External Media really appreciate it

nitin.life · August 17, 2009, 6:46am

any updates from anyone on this problem ?

:( :( :(

… I still cant solve the issue… I think I will submit it to nividia as a bug as I have no reason why would this happen ( after doing the above tests )…

Please , can any nividia guys tell me why a kernel without “cudaEventSynchronize(stop)” will synchronize? any hidden truths here ? ( AM not using memcopy and I know it synchronizes)

thanks

Topic		Replies	Views
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12745	July 26, 2010
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1775	July 19, 2022
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7741	December 5, 2008
How to detect async event without polling CUDA Programming and Performance	28	5773	August 20, 2010
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23224	October 7, 2010
Inconsistent kernel run times CUDA Programming and Performance	12	5801	August 5, 2009
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7807	March 8, 2013
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4661	February 15, 2018
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20163	May 4, 2007
Cuda code performance CUDA Programming and Performance	14	3170	December 16, 2014

KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ?

Related topics