Concurrent Kernel Execution on Fermi - confussion

zam23 · April 4, 2011, 10:05pm

I am trying to use concurrent kernels improve performance of my application.

First the problem space divided evenly and concurrent kernels were executed using different streams. My results proves that a significant performance gain because of overlapping of kernel execution and data transfer from D to H.

When I sum up kernel execution times of kernels it shows that new kernel execution time is higher than original with the default stream. That means kernels are not running simultaneously, I believe this might be due to resource sharing.

I am confident I have implemented the concurrent kernel execution program correctly and I need experts’ advice on this. It would be really appreciated.

In my application I am using a global buffer, in here each and every thread writes to a different location of the buffer, as this is a global buffer I do not think others have to wait until one thread is writing on to it.

Also I am using shared memory too, each thread reads from the another global array to shared memory and process data on it and writes back to global memory.

Please advise me.

Thanks

tmurray · April 4, 2011, 10:37pm

Post some source?

hyqneuron · April 5, 2011, 5:13am

Kernels from different streams do not run simultaneously when one kernel already occupies all the MPs… Another book I read says that kernels from different streams are finished before the next one can be launched, meaning there is nothing as overlapping kernels, but only overlapping kernel and memcpys. So I’m a bit in doubt… The first sentence is actually my own interpretation. Never tried small kernel launches to test it out.

As for the increased kernel time, I’m guessing that you re having small launch sizes and each thread utilises little resource such that one kernel launch does not occupy all the MPs. If this is the case, getting a few idle MPs to run the concurrent kernel may increase global memory bandwidth pressure and thus increase global memory operation time for all threads running.

tera · April 5, 2011, 5:38am

The book probably predates Fermi, at which time the statement actually was true.

hyqneuron · April 5, 2011, 6:09am

So are the other guesses I made correct?

zam23 · April 5, 2011, 1:10pm

Thanks for your comments, I will post some code here later, give me some time.

Is there a way to schedule SMs to different kernels?

avidday · April 5, 2011, 1:35pm

No. Scheduling heuristics on the GPU are completely undefined in the CUDA programming model.

hyqneuron · April 5, 2011, 2:52pm

Not defined… maybe. But you could do multiple launches of the same kernel, each time increasing the number of blocks by 1, to see the way in which blocks are distributed to different MPs. That order will remain the same if there is only 1 context and 1 stream. You can check the second half of the discussion here

With that method, you can, with a bit of manual adjustment, launch blocks on the exact MPs you want by judging the blockIdx and the %smid register.

avidday · April 5, 2011, 3:00pm

For your gpu, with the current assembler and driver you are using. NVIDIA make no guarantees about this stuff at all. It can and does change depending on things which are not defined in the programming model and not controllable with the APIs. It isn’t much fun designing and writing code which is predicated on assumptions that prove to be false in the next driver release…

RoofTopG · April 6, 2011, 2:43pm

What’s the run time of your kernels?

tera · April 6, 2011, 2:50pm

As others already pointed out, the sum of the individual kernel runtimes will increase by overlapping them, as they will contend for ressources. The overall execution time should decrease though.

zam23 · April 6, 2011, 3:19pm

Thanks everyone, here is my kernel,

When I run this as a single kernel(by default) with <<<8,256>>>, execution time : 8.40ms

when I run 2 kernels with two streams with <<<8,256>>>, ececution time : 8.56ms

when I run 4 kernels with 4 streams with <<<8,256>>>, ececution time : 9.10ms

Why this is not running concurrently? My GPU is 2.1 compute capability and I have tested concurrent kernel execution with simpler kernels, those are running concurrently.

“status” buffer is on shared memory, sh1_tbl, temper_tbl, … and all the other buffers are on constant space.

Please have a look into this and let me know your views, Thank you so much…

global void MTkernel(mtgp32_kernel_status_t* d_status, uint32_t* d_data, int size, int abid) {

const int bid = blockIdx.x;

const int tid = threadIdx.x;

const int actualBid = abid+bid;	

int pos = pos_tbl[actualBid];

uint32_t r;

// copy status data from global memory to shared memory.

status_read(status, d_status, actualBid, tid);

// main loop

for (int i = 0; i < size; i += LARGE_SIZE) {

	uint32_t Y;

	uint32_t X = (status[LARGE_SIZE - N + tid] & mask) ^ status[LARGE_SIZE - N + tid + 1];	

	X ^= X << sh1_tbl[actualBid];

	Y = X ^ (status[LARGE_SIZE - N + tid + pos] >> sh2_tbl[actualBid]);

	r = Y ^ param_tbl[actualBid][Y & 0x0f];

	status[tid] = r;

	uint32_t T = status[LARGE_SIZE - N + tid + pos - 1];

	T ^= T >> 16;

	T ^= T >> 8;

	d_data = r ^ temper_tbl[actualBid][T & 0x0f];

	__syncthreads();

	X = (status[(4 * THREAD_NUM - N + tid) % LARGE_SIZE]& mask) ^ status[(4 * THREAD_NUM - N + tid + 1) % LARGE_SIZE];

	X ^= X << sh1_tbl[actualBid];

	Y = X ^ (status[(4 * THREAD_NUM - N + tid + pos) % LARGE_SIZE] >> sh2_tbl[actualBid]);	

	r =  Y ^ param_tbl[actualBid][Y & 0x0f];

	status[tid + THREAD_NUM] = r;	

	T = status[(4 * THREAD_NUM - N + tid + pos - 1) % LARGE_SIZE];

	T ^= T >> 16;

	T ^= T >> 8;

	d_data = r ^ temper_tbl[actualBid][T & 0x0f];

	__syncthreads();

	X = (status[2 * THREAD_NUM - N + tid] & mask) ^ status[2 * THREAD_NUM - N + tid + 1];

	X ^= X << sh1_tbl[actualBid];

	Y = X ^ (status[2 * THREAD_NUM - N + tid + pos] >> sh2_tbl[actualBid]);

	r = Y ^ param_tbl[actualBid][Y & 0x0f];

	status[tid + 2 * THREAD_NUM] = r;

	T = status[tid + pos - 1 + 2 * THREAD_NUM - N];

	T ^= T >> 16;

	T ^= T >> 8;	

	d_data = r ^ temper_tbl[actualBid][T & 0x0f];

	__syncthreads();

}

// write back status for next call

status_write(d_status, status, actualBid, tid);

}

zam23 · April 6, 2011, 7:50pm

Hi I have posted my kernel… I am waiting for your response…

larrycheung · October 10, 2011, 9:43pm

I am trying to use concurrent kernels improve performance of my application.

First the problem space divided evenly and concurrent kernels were executed using different streams. My results proves that a significant performance gain because of overlapping of kernel execution and data transfer from D to H.

When I sum up kernel execution times of kernels it shows that new kernel execution time is higher than original with the default stream. That means kernels are not running simultaneously, I believe this might be due to resource sharing.

I am confident I have implemented the concurrent kernel execution program correctly and I need experts’ advice on this. It would be really appreciated.

In my application I am using a global buffer, in here each and every thread writes to a different location of the buffer, as this is a global buffer I do not think others have to wait until one thread is writing on to it.

Also I am using shared memory too, each thread reads from the another global array to shared memory and process data on it and writes back to global memory.

Please advise me.

Thanks

Hi I have the same problem.

I tried to used the concurrent kernel but I don’t see any speedup.

But I used memory copy and kernel execution overlap, it worked.

Also I saw the concurrent kernel of the SDK.

I do not think it does the right thing.

Topic		Replies	Views
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10619	June 26, 2012
Concurrently kernels running on one device CUDA Programming and Performance	17	2738	March 2, 2010
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6660	July 2, 2010
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	912	June 18, 2010
Concurrent kernel CUDA Programming and Performance	8	1720	January 14, 2024
help with kernel synchronization? CUDA Programming and Performance	22	13904	August 26, 2010
Not answered question? CUDA and OpenMP? CUDA Programming and Performance	28	10897	September 30, 2010
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13203	January 22, 2010
Concurrent Kernel Execution and Context switching Problem CUDA Programming and Performance	11	8285	July 8, 2015
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17728	April 1, 2010

Concurrent Kernel Execution on Fermi - confussion

Related topics