Dynamic Block Scheduling on hardware latencies

Thanks for the code Mr. Anderson. I intend to add more arithmetic to the kernel to expose the register-harard latency(which requires 192 threads min). Lets see how it goes…

Does any1 know what kind of operations cause register-hazard in the pipeline?

May b, if I find time, I will post my findings next week…

Mr.Anderson,

I think I have found a bug in the following code! It is not actually causing the “skips” to happen at the desired rate!!!

The function:

void setup_skips(int len, int skips)

	{

	for (int i = 0; i < len; i++)

  {

  h_idata[i] = 0;

  }

	for (int i = 0; i < skips; i++)

  {

  int skip = rand() % len;

  if (h_idata[skip])

  	{

  	i--;

  	continue;

  	}

  h_idata[skip] = 1;

  }

This function showers SKIP data throughtout “len” elements!

The “len” passed from the app covers the whole array! – which means that the skip data is shovelled across the whole array!

However – in the kernel code, the following check is being done:

__global__ void copy_gmem(int* g_idata, int* g_odata, int work)

	{

	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	// read in per block flag: true => quit processing immeadiately

	__shared__ int a;

	if (threadIdx.x == 0)

  {

  a = g_idata[blockIdx.x];

  }

	__syncthreads();

	if (a)

  return;

The decision on whether to return or not is based on "g_idata[blockIdx.x] – Thus only 0 to gridDim elements in this array determine what to be skipped and what not… However “setup_skips” is doing the same for the entire array and NOT for the first gridDim elements!!!

I hope you understand my point!!!

Kindly let me know if my understanding is wrong here!

I am trying to run this application here – some how it just hangs on my setup here!!! Let me see whats going on!!

Meanwhile, appreciate a feedback from you! I am in the process of writing a whitepaper! The discussion happening here could help me write some meaningful inputs!

Thanks,

Best Regards,

Sarnath

Well, I fixed the bug in the code. So, Here are the results:

"
$ ./DynamicBlock.exe
no skips
30000 619.348633

50% skips
30000 350.690918

90% skips
30000 134.198273
"

Isn’t that something interesting ???

THis is with 32-threads per block…

You are correct, the idata memory and setup_skips should only operate on an array with length equal to the number of blocks.

However, the bug is only that the array is made too large. Since the number of skips is len*some_factor, statistically the same proportion of skips is in the used portion of the array as the rest, so the results I posted still stand.

Any whitepaper should account for the apparent overhead which I think is due to a non-uniform distribution of active working blocks on MPs. Here are my previous comments on the subject:

A quick monte-carlo simulation should be able to verify this. The 192 thread block size had a different amount of overhead (presumably because fewer blocks would run concurrently on each mp). So that could be used as another data-point to verify the simulation.

I dont think so. The array size is 32*20000 and the number of blocks is 20,000.

The 50% on 32*20000 and 50% on 20,000 are two different things! You cant just interpret the results – you dont know what percent ran! Thats the problem!

—edit ----

What you said there still holds --some-- water! statistical proportion thing – yeah, I understand what you say… But still much depends on how “rand” function works and what it is seeded with and so on… Anyway, it is still non-deterministic is what I would say.

—edit over----

Anyway, The latencies will get exposed when you try to more multiplication, division and some floating point arithmetic + global memory accesses! Let me see if I can get to show more meaningful output!

I still understand your non-uniform distribution of blocks among MPs – That should hit only for the last few bunch of blocks (on my GTX with 8 active blocks – only the last 128 blocks to be scheduled would expose latencies… Others should still run fine if the active blocks are dynamically replaced).

Anyway, I am going to dig further with more experiments! I will keep you all posted.

I added a few floating point computations and some coalesced global memory access to the same kernel. Here is what I get.

"
$ ./DynamicBlock.exe
no skips
30000 4308.401855

50% skips
30000 2944.312012

90% skips
30000 1503.874634
"

The 50% looks like 68% work.
The 10% looks like 35% work.

Note that we are spawning 20,000 blocks! For the 10% case, 2000 blocks will totally execute!!!

Since the kernel runs for a long period of time – if we had active blocks replacement then the MPs will settle down to execute the full active blocks! but that does NOT seem to be the case! Thats my point!

Well, what do you guys think?

Can you use block id rather than global memory data to outschedule blocks? for example use something like this in the beginning of the kernel?

if( (blidIdx.x & 1) + (blidIdx.y & 1) == 1) return;

This is a cool idea as well - it will work good for 50% case.

The global memory will help to simulate for any %, we would like to – which is more convenient!

You are right. But in this case you don’t have to access the global memory to outschedule blocks…