I have a quadro k4000 GPU (compute capability 3.0) with 2 monitors attached. When I open the Nvidia control panel, it says somewhere less than 10% of the GPU is in use without running any GPGPU code.
I am attempting to run a kernel I wrote with 4 copies. Initially I just copied the kernel call 4 times, but was not able to see any concurrent execution in the Nvidia Profiler, i.e., all of the calls executed sequentially. To cut through some long reasoning, I decided to write a dummy kernel that takes 2 int arrays, adds them, and stores them in a third. Here is that kernel.
__global__ void dummy_kernel(int *a, int *b, int *c, int size)
{
int tidx = threadIdx.x;
for (int jj = 0; jj < 1024*8; ++jj)
{
int temp = jj % 64;
for (int ii = tidx; ii < size; ii += 32)
{
if (ii + temp >= size)
temp = 0;
c[ii] = a[ii] + b[ii];
}
}
}
This is called as:
cudaStream_t cuda0, cuda1, cuda2, cuda3;
cudaStreamCreateWithFlags(&cuda0, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&cuda1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&cuda2, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&cuda3,cudaStreamNonBlocking);
fprintf(stdout, "Starting GPU code (with dummy kernel).\n");
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
clock_t begin = clock();
cudaEventRecord(start);
my_kernel << < 400, 32, 0, cuda0 >> >(parameter_list);
dummy_kernel << < 400, 32, 0, cuda1 >> >(a1, b1, c1, 1024);
dummy_kernel << < 400, 32, 0, cuda2 >> >(a2, b2, c2, 1024);
dummy_kernel << < 400, 32, 0, cuda3 >> >(a3, b3, c3, 1024);
cudaStreamDestroy(cuda0);
cudaStreamDestroy(cuda1);
cudaStreamDestroy(cuda2);
cudaStreamDestroy(cuda3);
When I compiled and ran this in the NVidia profiler, there was very little execution overlap between kernels. Am I doing something wrong with how I set up the streams or is there something else that I need to do? The GPU Utilization reported by the NVidia Control Panel jumps to 100%. The profiler reports that each kernel has a max Theoretical occupancy of 25%. I assume that means each kernel was launched in its own SMX (the Quadro K4000 has 4 SMXs).
I am compiling with Visual Studio 2013 with the -default-stream per-thread command line flag.