Hi all,
our application processes large data with a use of many small kernels. We process the data in bursts because some kernels take a performance advantage when running over a whole burst (but some others don’t - the number of them is relative to a burst size). This means the larger the burst can be the more performance we can get.
But we found that we cannot increase the burst size as much as we would sometimes like. That’s because when many kernels are waiting in a queue, the new kernel launches become blocking. This is what we would definitely need to avoid because while the kernels run we need to prepare other bursts on a cpu.
It’s not possible to find the “ok” burst size as we actually don’t know the number of kernels in a queue and also the device’s performance vary so for different devices a different number of kernels would be finished in the same code point.
Now we found the actual number of kernels in a queue when the others become blocking, is 1024. You can try with a simple example (slightly modified vectorAdd):
* Copyright 1993-2013 NVIDIA Corporation. All rights reserved.
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
* Vector addition: C = A + B.
* This sample is a very basic sample that implements element by element
* vector addition. It is the same as the sample illustrating Chapter 2
* of the programming guide with some additions like error checking.
#include <stdio.h>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>
* CUDA Kernel Device code
* Computes the vector addition of A and B into C. The 3 vectors have the same
* number of elements numElements.
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
C[i] = A[i] + B[i];
* Host main routine
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Print the vector length to be used, and compute its size
int numElements = 50000000; // make the size large enough for the kernels to last some time (to overcome the cpu kernel launch overhead ~ 7us)
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]\n", numElements);
// Allocate the host input vector A
float *h_A = (float *)malloc(size);
// Allocate the host input vector B
float *h_B = (float *)malloc(size);
// Allocate the host output vector C
float *h_C = (float *)malloc(size);
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL)
fprintf(stderr, "Failed to allocate host vectors!\n");
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i)
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
// Allocate the device input vector A
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);
if (err != cudaSuccess)
fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
// Allocate the device input vector B
float *d_B = NULL;
err = cudaMalloc((void **)&d_B, size);
if (err != cudaSuccess)
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
// Allocate the device output vector C
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, size);
if (err != cudaSuccess)
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
// Copy the host input vectors A and B in host memory to the device input vectors in
// device memory
printf("Copy input data from the host memory to the CUDA device\n");
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
// launch 1030 kernels
for (int i = 0; i < 1030; i++) {
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
// printf("CUDA kernel launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
err = cudaGetLastError();
if (err != cudaSuccess)
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n", cudaGetErrorString(err));
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memory\n");
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i)
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
fprintf(stderr, "Result verification failed at element %d!\n", i);
printf("Test PASSED\n");
// Free device global memory
err = cudaFree(d_A);
if (err != cudaSuccess)
fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
err = cudaFree(d_B);
if (err != cudaSuccess)
fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
err = cudaFree(d_C);
if (err != cudaSuccess)
fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
// Free host memory
// Reset the device and exit
err = cudaDeviceReset();
if (err != cudaSuccess)
fprintf(stderr, "Failed to deinitialize the device! error=%s\n", cudaGetErrorString(err));
return 0;
Here’s what profiler will show:
External Media
So our questions are:
Thanks a lot.