Concurrent Kernel Execution on Fermi - confussion

I am trying to use concurrent kernels improve performance of my application.

First the problem space divided evenly and concurrent kernels were executed using different streams. My results proves that a significant performance gain because of overlapping of kernel execution and data transfer from D to H.

When I sum up kernel execution times of kernels it shows that new kernel execution time is higher than original with the default stream. That means kernels are not running simultaneously, I believe this might be due to resource sharing.

I am confident I have implemented the concurrent kernel execution program correctly and I need experts’ advice on this. It would be really appreciated.

In my application I am using a global buffer, in here each and every thread writes to a different location of the buffer, as this is a global buffer I do not think others have to wait until one thread is writing on to it.

Also I am using shared memory too, each thread reads from the another global array to shared memory and process data on it and writes back to global memory.

Please advise me.


Post some source?

Kernels from different streams do not run simultaneously when one kernel already occupies all the MPs… Another book I read says that kernels from different streams are finished before the next one can be launched, meaning there is nothing as overlapping kernels, but only overlapping kernel and memcpys. So I’m a bit in doubt… The first sentence is actually my own interpretation. Never tried small kernel launches to test it out.

As for the increased kernel time, I’m guessing that you re having small launch sizes and each thread utilises little resource such that one kernel launch does not occupy all the MPs. If this is the case, getting a few idle MPs to run the concurrent kernel may increase global memory bandwidth pressure and thus increase global memory operation time for all threads running.

The book probably predates Fermi, at which time the statement actually was true.

So are the other guesses I made correct?

Thanks for your comments, I will post some code here later, give me some time.

Is there a way to schedule SMs to different kernels?

No. Scheduling heuristics on the GPU are completely undefined in the CUDA programming model.

Not defined… maybe. But you could do multiple launches of the same kernel, each time increasing the number of blocks by 1, to see the way in which blocks are distributed to different MPs. That order will remain the same if there is only 1 context and 1 stream. You can check the second half of the discussion here

With that method, you can, with a bit of manual adjustment, launch blocks on the exact MPs you want by judging the blockIdx and the %smid register.

For your gpu, with the current assembler and driver you are using. NVIDIA make no guarantees about this stuff at all. It can and does change depending on things which are not defined in the programming model and not controllable with the APIs. It isn’t much fun designing and writing code which is predicated on assumptions that prove to be false in the next driver release…

What’s the run time of your kernels?

As others already pointed out, the sum of the individual kernel runtimes will increase by overlapping them, as they will contend for ressources. The overall execution time should decrease though.

Thanks everyone, here is my kernel,

When I run this as a single kernel(by default) with <<<8,256>>>, execution time : 8.40ms

when I run 2 kernels with two streams with <<<8,256>>>, ececution time : 8.56ms

when I run 4 kernels with 4 streams with <<<8,256>>>, ececution time : 9.10ms

Why this is not running concurrently? My GPU is 2.1 compute capability and I have tested concurrent kernel execution with simpler kernels, those are running concurrently.

“status” buffer is on shared memory, sh1_tbl, temper_tbl, … and all the other buffers are on constant space.

Please have a look into this and let me know your views, Thank you so much…

global void MTkernel(mtgp32_kernel_status_t* d_status, uint32_t* d_data, int size, int abid) {

const int bid = blockIdx.x;

const int tid = threadIdx.x;

const int actualBid = abid+bid;	

int pos = pos_tbl[actualBid];

uint32_t r;

// copy status data from global memory to shared memory.

status_read(status, d_status, actualBid, tid);

// main loop

for (int i = 0; i < size; i += LARGE_SIZE) {

	uint32_t Y;

	uint32_t X = (status[LARGE_SIZE - N + tid] & mask) ^ status[LARGE_SIZE - N + tid + 1];	

	X ^= X << sh1_tbl[actualBid];

	Y = X ^ (status[LARGE_SIZE - N + tid + pos] >> sh2_tbl[actualBid]);

	r = Y ^ param_tbl[actualBid][Y & 0x0f];

	status[tid] = r;

	uint32_t T = status[LARGE_SIZE - N + tid + pos - 1];

	T ^= T >> 16;

	T ^= T >> 8;

	d_data = r ^ temper_tbl[actualBid][T & 0x0f];


	X = (status[(4 * THREAD_NUM - N + tid) % LARGE_SIZE]& mask) ^ status[(4 * THREAD_NUM - N + tid + 1) % LARGE_SIZE];

	X ^= X << sh1_tbl[actualBid];

	Y = X ^ (status[(4 * THREAD_NUM - N + tid + pos) % LARGE_SIZE] >> sh2_tbl[actualBid]);	

	r =  Y ^ param_tbl[actualBid][Y & 0x0f];

	status[tid + THREAD_NUM] = r;	

	T = status[(4 * THREAD_NUM - N + tid + pos - 1) % LARGE_SIZE];

	T ^= T >> 16;

	T ^= T >> 8;

	d_data = r ^ temper_tbl[actualBid][T & 0x0f];


	X = (status[2 * THREAD_NUM - N + tid] & mask) ^ status[2 * THREAD_NUM - N + tid + 1];

	X ^= X << sh1_tbl[actualBid];

	Y = X ^ (status[2 * THREAD_NUM - N + tid + pos] >> sh2_tbl[actualBid]);

	r = Y ^ param_tbl[actualBid][Y & 0x0f];

	status[tid + 2 * THREAD_NUM] = r;

	T = status[tid + pos - 1 + 2 * THREAD_NUM - N];

	T ^= T >> 16;

	T ^= T >> 8;	

	d_data = r ^ temper_tbl[actualBid][T & 0x0f];



// write back status for next call

status_write(d_status, status, actualBid, tid);


Hi I have posted my kernel… I am waiting for your response…

Hi I have the same problem.

I tried to used the concurrent kernel but I don’t see any speedup.

But I used memory copy and kernel execution overlap, it worked.

Also I saw the concurrent kernel of the SDK.

I do not think it does the right thing.