Howdy, Stranger!
It looks like you're new here. If you want to get involved, click one of these buttons!
Categories
- All Discussions1,524
- General534
- Graphics109
- GPU Computing419
- Mobile141
- Pro Graphics163
- Tools158
In this Discussion
- fwende February 10
Concurrent Kernel Execution - Scheduling Mechanism
-
currently, i'am doing some experiments on how to efficiently adapt concurrent kernel execution to my projects. to get a deeper insight into how kernels are scheduled on the gpu/device, i played around with a small kernel doing some simple math on a vector of O(10^6) floats. the kernel itself is designed such that it uses just one threadblock consisting of 256 threads, so that each kernel can be mapped onto one multiprocessor.
if i now run n=32 of these kernels (each of them manipulating its own vector) using the same stream, i get a total runtime for all kernels which is t_0=432ms. running all 32 kernels using 32 different streams gives a total runtime t_1=28ms, which is approximately t_0/15, which is close to 16 (i run the program on a tesla m2090, which is capable of running up to 16 kernels concurrently). using just 16 different streams also results in t_1=28ms, which is clear, as the device allows for at most 16 concurrent kernels at the same time.
#define NUM_KERNELS 32
#define NUM_STREAMS 16
#define NUM_VEC_ELEMENTS ( 1024*1024 )
static float *d_array[NUM_KERNELS];
static float *h_array[NUM_KERNELS];
static __global__ void kernel( float *ptr ) {
// do some math
}
int main() {
cudaStream_t
streamPool[NUM_KERNELS],
*stream[NUM_KERNELS];
// create arrays on device & host.
for( int i=0; i<NUM_KERNELS; i++ ) {
cudaMalloc( (void **) &d_array[i], NUM_VEC_ELEMENTS*sizeof( float ) );
h_array[i] = new float[NUM_VEC_ELEMENTS];
}
// fill 'h_array' at random using drand48(),
// and copy data to device ('d_array').
// create 'NUM_KERNELS' streams.
for( int i=0; i<NUM_KERNELS; i++ )
cudaStreamCreate( &streamPool[i] );
// map streams onto 'streamPool' using this pattern:
// 0 1 2 3 ... 15 0 1 2 3 ... 15
for( int i=0; i<NUM_KERNELS; i++ )
stream[i] = &streamPool[i%NUM_STREAMS];
// start timer, and then run kernels.
for( int i=0; i<NUM_KERNELS; i++ )
kernel<<< 1, 256, 0, ( *stream[i] ) >>>( d_array[i] );
// cudaDeviceSynchronize & stop timer.
// copy data from device in order to check results.
// free memory.
return 0;
}
up to this point, all is well. if i now switch to the following stream mapping
// code as above
int main() {
...
// map streams onto 'streamPool' using this pattern.
// 0 0 1 1 2 2 3 3 ... 15 15
for( int i=0; i<NUM_KERNELS; i++ )
stream[i] = &streamPool[i/( NUM_KERNELS/NUM_STREAMS )];
...
}
the total runtime is t_2=230ms, which gives a speedup of almost a factor 2 compared with serial kernel execution. my understanding of this is as follows:
kernels sent to gpu are placed into some kind of gpu-internal queue (i have a dequeue in mind). say we have some kind of scheduler on the gpu, then this scheduler takes kernels from the queue's front and makes them run on any available multiprocessor. if successive kernels in the queue should run on different streams, the scheduler is free to run them concurrently on the device, but at most 16 of them. if two (or more) successive kernels should run on the same stream, the scheduler takes the first of them and then stops dequeuing kernels from the queue, since they may be not independent (it also stops dequeuing even if after the stop-dequeuing-kernel there is a kernel that should run on a different stream). it restarts dequeuing kernels if the previous kernel (which makes the scheduler stop dequeuing) is finished.
with respect to my second code-sample (which gives a speedup of almost 2, although 16 different streams are used), this would explain the bad performance. instead of 16 kernels, there are just 2 kernels that can run concurrently on the device. all in all, this happens 15 times, so that the speedup over serial execution is 32/(32-32/2+1)=1.88 -> execution time t_2=(432/1.88)ms=230ms.
my question now is: are my considerations right? is there a scheduler on the device, which acts similar to what i described above.
if yes: the speedup that can be achieved using concurrent kernel execution significantly depends on the order in which kernels are send to gpu/device. -
1 Comment sorted by
-
recently, i found a slide-show-pdf directly from nvidia which in a certain sense supports what i described.
'maximizing gpu efficiency in extreme throughput applications' (2009)
its a litle bit old, but the main principles should not have changed (and according to my measurements they did not).