Dynamic Block Scheduling on hardware latencies

I have a kernel of mine which is launched with some 20,000 blocks.

but nearly 1/4th or 1/2 of the blocks immediately realize that there is no computation left for them (that is determined by the user-input in global memory) and they just exit away creating “block hole” on multi-processors.

my question is “Will the CUDA hardware immediately dispatch another block to fill the block-hole?”. THis is because block-holes tend to expose latencies in the kernel which kills performance.

My experiments tend to imply that the CUDA hardware is NOT filling block-holes immediately. It looks like – The CUDA hardware first dispatches “N” blocks on a multi-processor AND then waits for all the N to complete AND then schedule another set of “N” blocks and so on.

Can some1 throw light on this hardware issue?

I understand perfectly that one should NOT code based on such hardware details. but I am just wondering if NVIDIA could possibly consider dynamic block scheduling on multi-processors (assuming that my finding is right)

Best Regards,
Sarnath

I’m curious too, if what you say is right then it makes no sense to have some blocks finish sooner than others (which would be kind of bad?)

This could be kind of true because I wrote a CPU helper function to identify such block-holes and compact them. It resulted in good speed up. I was able to save 10ms while processing 1000 binomial options.

BUT THEN,

I also developed an alternate version which would spawn fixed number of blocks that would process all the non-compacted blocks. That did NOT gimme the required speedup.

So, my claim need not be 100% true. If an NVIDIA person could clarify this, it would be great!

Hmm, that is indeed interesting, and I hope to be able to give some input into this thread soon (that means I am ready to scale my code up to large numbers, I am still missing 1 piece of code to do that. For me it is even worse, I will have some blocks that will find out they have nothing to update and exit early (and no way to find out before in my host-code). The difference in time between a block that has to do something and a block that has nothing to do will be around <18 vs 180 microseconds. I was kind of hoping that I would process 10 blocks that do nothing in the time of 1 block doing something (in reality, it is 4 - 32 blocks doing something vs 4-32 blocks doing nothing (depending on how many threads I turn out to be able to stuff in a block)

So I will have a gridDim in the shape (number_of_parallel_computations, number_of_blocks_per_computation,1) And my ‘holes’ will be in number_of_parallel_computations.

If first all blocks with blockIdx.x ==0 get scheduled, and after all blocks with blockIdx.x==1 I will probably have not too much trouble when I use 16 blocks or more per computation. Does anybody know which dimension is ‘processed’ first?

I guess a lot of benchmarking will be required to find out my optimal grid & block dimensions.

I assume you have to terminate all threads in a half-warp to free any computational resources. E.g if threads 16-31 terminates, one (execution slot of a) SIMD processor is liberated.

Edit: Oh, you are talking about blocks here. Sorry, my mistake. (mental note to self - no thinking before first cup of coffee)

To add something to this: In my case, the block holes appeared at constant intervals. This would mean that some blocks would see only holes and some blocks would see only compute. This again is bad as the block that is seeing holes would exit soon resulting in exposing latencies. So, my original inference could still hold water.

Even if they are already replacing a block as soon as it terminates, it is a bad idea to have many blocks that do not do any work in a given grid.

The time to start a new block must be non-zero. Even assuming a perfect scheduler, compared to a program with P-work blocks, a similar with P-work blocks and M-“black hole” blocks would take more time simply due to the time taken to create the additional blocks.

What would be interesting to know is the number of cycles required to schedule a new block in relation to the amount of time for other operations on the card (like a global memory access).

Yes, it is a bad idea if the programmer can eliminate it. But for some problems, the user-input determines the computation and NOT your program. In such cases, it is in-evitable. We have to live with that.

Not really. Block Holes causes block to exit early and thereby cause lesser amount of active blocks inside your multi-processor. And, if there are less than 192 threads inside your block – you will end up exposing register latencies and global memory latencies which slow down performance. Espescially, in my kernel, there are only 32 threads per block. So, I require the multi-processor to be saturated with atleast 6 blocks at a time. So, if the scheduler is NOT dynamically replacing block holes and rather wait for all blocks in a multi-processor to complete, I end up exposing latencies and a bad performance. I am seeing this in my kernel now. Thats why I raised this topic.

Yeah, That would be nice to know. I also feel that creating a new block muss be very simple operation. Just reset the instruction pointer for all WARPs to original and changed the “ctaid” register and start execution. Thats all.

Denis,

Did you ever get a chance to experiment with the “block hole” code?

Can some1 from Nvidia throw some light on this dynamic block scheduling?? DOes the latest Tesla model have some improvements in this area?

Thank you
Best Regards,
Sarnath

No not yet, I will have to integrate my code into a large framework before I will have this happening. I am now in the process of learning the basis on which the framework is built… So it will not happen quickly :D

Thank you Denis!

Can some1 from NVIDIA clarify things on this dynamic block scheduling. I could still boost my performance a lot if this happens right on hardware.

Can some1 throw some light, pleasssssssssssssssssssseeeeeeeeeeee ?

I’ll think about it today and see if I can come up with a toy benchmark to make some measurements. Maybe I’ll have some results tonight.

Thank you so much for your time. Appreciate it very much!!!

I ran a kernel where each block either immeadiatley returned or performed a large amount of additions in a for loop (code can be posted if you really want it).

When all blocks perform the calculation, it takes 61.5 ms.

When only 50% of the blocks perform the calculation (which ones are determined randomly), it takes 33.7 ms. 33.7 / 61.5 = 0.5480. This is close to 0.5, the ideal value if blocks fill in holes immediately. The extra time over 0.5 might be due to some sort of scheduling overhead.

I repeated this for various amounts of work performed in each block and got the same result.

When 90% of the blocks perform the calculation, it takes only 9.6 ms to complete.
61.5 / 9.6 = 0.1561. Again, this is close to the ideal value of 0.1 but with a little bit of overhead.

It seems clear that the hardware does fill block holes immediately, but there is a small amount of extra time that might be explained by a small amount of scheduling overhead.

Mr.Anderson,

Thank you for your experiments.

May I know –

  1. How many blocks were active in your multi-processor at any given time?
  2. What is the shared memory usage of each block.
  3. What is the number of threads per block that you were running.

These factors still affect dynamic block scheduling.

The idea is – when the block-holes are NOT filled → The latencies should get exposed.

For example: You could try with “32 threads per block” and run 6 active blocks in a Multi-processor (192 threads make up for hiding the register-latencies). Now, 25% or 50% of blocks should exit out.

Can you share these data so that we can scrutinize this still further?

Once again, Thank you very much for your time.

Best Regards,
Sarnath

Thank you very much for this information!

Indeed, this is very helpful knowledge for my coming project, where I will have blocks with BIG computation time that sometimes just skip them self after a very short computation time.

No problem guys, glad I could help.

The test I posted above was with 192 threads per block and shared memory usage of 192*4+arguments bytes per block. This is 100% occupancy with 4 blocks per MP. Per the original post, I ran 20,000 blocks.

As requested, here are the numbers for a block size of 32 for 8 blocks on each MP. There are still only 20,000 blocks but I increased the amount of “work” done by each block by a factor of 192/32 so the numbers between the two tests are comparable.

No skips: 62.16 ms
50% skips: 34.8 ms :: 34.8 / 62.16 = 0.56
90% skips: 15.05 ms :: 15.05 / 62.16 = 0.242

These results are very similar to before, but with slightly more overhead.

Thinking about it now, the extra overhead with 90% of the blocks skipped might not be due to a scheduling overhead, but rather due to a non-uniform distribution of “active” blocks amount the MPs. I.e. one mp might get 8 blocks that actually perform calculations while another keeps getting blocks that exit early, so that MP effectively does nothing.

A quick simulation with randomly scheduled blocks could confirm that hypothesis, but I only have so much free time ;) In any case, non-uniform distributions of active blocks should only be a major overhead when there are very large numbers of skipped blocks. Everyone in this thread so far seems to be indicating that their applications have a small fraction of skipped blocks, so this is a non-issue.

Thanks a lot Mr.Anderson.

Can you kindly post the source code that you used to test, so that I can extrapolate if I find time?

Once again, Really appreciate your time and interest.

Thank you

Best Regards,
Sarnath

Code attached.

Edit: odd, whenever I attach the file it only adds a 0 byte file. I’m including it in a code block below. Apologies for any lack of tab formatting in the resulting version.

#include <stdio.h>

#  define CUDA_SAFE_CALL( call) do {                                         \

    cudaError err = call;                                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

    exit(EXIT_FAILURE);                                                      \

    } } while (0)

#ifdef NDEBUG

#define CUT_CHECK_ERROR(errorMessage)

#else

 #  define CUT_CHECK_ERROR(errorMessage) do {                                 \

    cudaThreadSynchronize();                                                \

    cudaError_t err = cudaGetLastError();                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \

                errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

#endif

#define BLOCK_SIZE 32

int *d_odata, *d_idata, *h_idata;

__global__ void copy_gmem(int* g_idata, int* g_odata, int work)

	{

	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	// read in per block flag: true => quit processing immeadiately

	__shared__ int a;

	if (threadIdx.x == 0)

  {

  a = g_idata[blockIdx.x];

  }

	__syncthreads();

	if (a)

  return;

	// the flag was false, perform work sumes on shared memory

	__shared__ int sdata[BLOCK_SIZE];

	sdata[threadIdx.x] = 0;

	for (unsigned int i = 0; i < work; i++)

  {

  sdata[threadIdx.x] += work;

  }

	g_odata[idx] = sdata[threadIdx.x];

	}

void do_bmark(dim3 grid, dim3 threads, int work)

    {

	copy_gmem<<< grid, threads >>>(d_idata, d_odata, work);

	

	cudaEvent_t start, end;

	CUDA_SAFE_CALL( cudaEventCreate(&start) );

	CUDA_SAFE_CALL( cudaEventCreate(&end) );

  

	CUDA_SAFE_CALL( cudaEventRecord(start, 0) );

	for (int i=0; i < 100; ++i)

  {

  copy_gmem<<< grid, threads >>>(d_idata, d_odata, work);

  }

	CUDA_SAFE_CALL( cudaEventRecord(end, 0) );

	CUDA_SAFE_CALL( cudaEventSynchronize(end) );

	float runTime;

	CUDA_SAFE_CALL( cudaEventElapsedTime(&runTime, start, end) );

	runTime /= float(100);

	printf("%d %f\n", work, runTime);

	CUDA_SAFE_CALL( cudaEventDestroy(start) );

	CUDA_SAFE_CALL( cudaEventDestroy(end) );

	}

void setup_skips(int len, int skips)

	{

	for (int i = 0; i < len; i++)

  {

  h_idata[i] = 0;

  }

	for (int i = 0; i < skips; i++)

  {

  int skip = rand() % len;

  if (h_idata[skip])

  	{

  	i--;

  	continue;

  	}

  h_idata[skip] = 1;

  }

	CUDA_SAFE_CALL( cudaMemcpy(d_idata, h_idata, sizeof(int)*len, cudaMemcpyHostToDevice) );

	}

int main()

	{

	int len = BLOCK_SIZE*20000;

	int num_threads = BLOCK_SIZE;

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_idata, sizeof(int)*(len)) );

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_odata, sizeof(int)*(len)) );

	h_idata = (int *)malloc(sizeof(int) * len);

	dim3  threads(num_threads, 1, 1);

	dim3  grid(len/num_threads, 1, 1);

	printf("no skips\n");

	setup_skips(len, 0);

	int work = 30000;

	do_bmark(grid, threads, work);

	

	printf("\n50%% skips\n");

	setup_skips(len, len*0.5);

	

	do_bmark(grid, threads, work);

	printf("\n90%% skips\n");

	setup_skips(len, len*0.9);

	

	do_bmark(grid, threads, work);

  

	return 0;

	}