Overlapping GPU and CPU computation?

StevenGalea · November 15, 2010, 1:25pm

Hi,

I am experiencing a problem running CUDA and CPU code concurrently. My understanding (after looking at page 13 of http://www.nvidia.com/content/GTC/documents/1122_GTC09.pdf ) is that it is possible to launch a CUDA kernel asynchronously and carry out work on the CPU while the kernel is executing. Consider the example below:

Sample.cu

__global__ void CUDA_Long_Kernel(float* num)

{

    int x = blockIdx.x * blockDim.x + threadIdx.x;

    int y = blockIdx.y * blockDim.y + threadIdx.y;

for (int i=0; i< 90000; i++)

    {

        num[y*blockDim.x + x] = 0;

__syncthreads();

    }

}

extern "C" void CUDA_Long_Kernel(dim3 threadsPerBlock, dim3 numBlocks, cudaStream_t stream, float* pos)

{

CUDA_Long_Kernel<<<numBlocks, threadsPerBlock, 0, stream>>>(pos);

}

main.cpp

#include <stdio.h>

#include <cuda.h>

#include <cutil.h>

#include <cuda_runtime_api.h>

#include <cutil_inline_drvapi.h>

#include <cutil_inline_runtime.h>

#include <windows.h>

extern "C" void CUDA_Long_Kernel(dim3 threadsPerBlock, dim3 numBlocks, cudaStream_t stream, float* pos);

int main(int argc, char* argv[])

{

    cudaStream_t stream;

    cudaStreamCreate(&stream);

dim3 threadsPerBlock(16, 16);

    dim3 numBlocks(30, 30, 1);

    float* pos;

    cudaMalloc((void**)&pos, numBlocks.x * numBlocks.y * threadsPerBlock.x * threadsPerBlock.y);

unsigned int hTimer;

    cutCreateTimer(&hTimer);

    cutResetTimer(hTimer);

    cutStartTimer(hTimer);

//GPU work - (this takes around 530ms on my machine)

    CUDA_Long_Kernel(threadsPerBlock, numBlocks, stream, pos);

//CPU work - just waste some time (this takes around 93ms on my machine)

    int x = 0;

    for (int i=0; i< 5000000; i++)

    {

        x =pow(x,(double)2);        

    }

cudaThreadSynchronize();

printf("Processing time: %f msec  - %i\n", cutGetTimerValue(hTimer), x);

    cutDeleteTimer(hTimer);

cudaFree(pos);

cudaStreamDestroy(stream);

    getchar();

}

If I execute the CPU work alone it takes about 93ms and the GPU work alone takes around 530ms (on a Core i7 930 with a GeForce GTX470). When I run the example above, the code takes about 630ms. Shouldn’t the GPU work hide the work being carried out on the CPU? i.e. I would expect the total runtime to be somewhere around 530ms. Can anyone please shed some light on why this is happening?

Thanks

Steven

StevenGalea · November 15, 2010, 1:25pm

Hi,

I am experiencing a problem running CUDA and CPU code concurrently. My understanding (after looking at page 13 of http://www.nvidia.com/content/GTC/documents/1122_GTC09.pdf ) is that it is possible to launch a CUDA kernel asynchronously and carry out work on the CPU while the kernel is executing. Consider the example below:

Sample.cu

__global__ void CUDA_Long_Kernel(float* num)

{

    int x = blockIdx.x * blockDim.x + threadIdx.x;

    int y = blockIdx.y * blockDim.y + threadIdx.y;

for (int i=0; i< 90000; i++)

    {

        num[y*blockDim.x + x] = 0;

__syncthreads();

    }

}

extern "C" void CUDA_Long_Kernel(dim3 threadsPerBlock, dim3 numBlocks, cudaStream_t stream, float* pos)

{

CUDA_Long_Kernel<<<numBlocks, threadsPerBlock, 0, stream>>>(pos);

}

main.cpp

#include <stdio.h>

#include <cuda.h>

#include <cutil.h>

#include <cuda_runtime_api.h>

#include <cutil_inline_drvapi.h>

#include <cutil_inline_runtime.h>

#include <windows.h>

extern "C" void CUDA_Long_Kernel(dim3 threadsPerBlock, dim3 numBlocks, cudaStream_t stream, float* pos);

int main(int argc, char* argv[])

{

    cudaStream_t stream;

    cudaStreamCreate(&stream);

dim3 threadsPerBlock(16, 16);

    dim3 numBlocks(30, 30, 1);

    float* pos;

    cudaMalloc((void**)&pos, numBlocks.x * numBlocks.y * threadsPerBlock.x * threadsPerBlock.y);

unsigned int hTimer;

    cutCreateTimer(&hTimer);

    cutResetTimer(hTimer);

    cutStartTimer(hTimer);

//GPU work - (this takes around 530ms on my machine)

    CUDA_Long_Kernel(threadsPerBlock, numBlocks, stream, pos);

//CPU work - just waste some time (this takes around 93ms on my machine)

    int x = 0;

    for (int i=0; i< 5000000; i++)

    {

        x =pow(x,(double)2);        

    }

cudaThreadSynchronize();

printf("Processing time: %f msec  - %i\n", cutGetTimerValue(hTimer), x);

    cutDeleteTimer(hTimer);

cudaFree(pos);

cudaStreamDestroy(stream);

    getchar();

}

If I execute the CPU work alone it takes about 93ms and the GPU work alone takes around 530ms (on a Core i7 930 with a GeForce GTX470). When I run the example above, the code takes about 630ms. Shouldn’t the GPU work hide the work being carried out on the CPU? i.e. I would expect the total runtime to be somewhere around 530ms. Can anyone please shed some light on why this is happening?

Thanks

Steven

seibert · November 15, 2010, 1:51pm

Are you compiling in Debug mode? There are reports that this disables the asynchronous launch (as does using the profiler).

seibert · November 15, 2010, 1:51pm

Are you compiling in Debug mode? There are reports that this disables the asynchronous launch (as does using the profiler).

StevenGalea · November 15, 2010, 2:10pm

No I’m compiling in release mode with no profiler attached. The kernel launch is technically asynchronous in that if I put a printf after the CPU code and a printf in the CUDA kernel, the CPU printf shows up first. The problem is that apparently the device is waiting for the CPU to finish before starting to execute the kernel.

Steven

StevenGalea · November 15, 2010, 2:10pm

No I’m compiling in release mode with no profiler attached. The kernel launch is technically asynchronous in that if I put a printf after the CPU code and a printf in the CUDA kernel, the CPU printf shows up first. The problem is that apparently the device is waiting for the CPU to finish before starting to execute the kernel.

Steven

StevenGalea · November 16, 2010, 3:19pm

I also tried executing the simple streams example provided in the SDK. As you can see in the image attached something is terribly wrong as the time taken when using streams is practically identical to the time taken without using streams :s Can anyone please shed some light on what is wrong?

Thanks in advance,

Steven

StevenGalea · November 16, 2010, 3:19pm

I also tried executing the simple streams example provided in the SDK. As you can see in the image attached something is terribly wrong as the time taken when using streams is practically identical to the time taken without using streams :s Can anyone please shed some light on what is wrong?

Thanks in advance,

Steven

Nighthawk13 · November 19, 2010, 11:00pm

Try launching the kernel once before starting the timer - maybe there is some CPU-bound one-time-initialization going on.

Nighthawk13 · November 19, 2010, 11:00pm

Try launching the kernel once before starting the timer - maybe there is some CPU-bound one-time-initialization going on.

Topic		Replies	Views
Overlapping GPU and CPU computation? CUDA Programming and Performance	5	4281	November 15, 2010
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20125	May 4, 2007
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	906	June 18, 2010
CUDA thread in background? CUDA Programming and Performance	10	15990	February 19, 2010
My streams are not running concurrently CUDA Programming and Performance	7	1771	March 6, 2018
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7735	December 5, 2008
How to effectively parallelize cuda kernel launches on CPU CUDA Programming and Performance	9	3053	January 19, 2018
How to Launch Cuda kernel in different processes CUDA Programming and Performance	8	3707	November 6, 2018
Launching several kernels on one stream while another kernel running persistently in the background CUDA Programming and Performance	1	712	October 8, 2016
Kernel execution blocks CPU code CUDA Programming and Performance	9	3956	September 8, 2009

Overlapping GPU and CPU computation?

Related topics