ASYNCHRONOUS CALLS how do we FIGURE OUT this

dlmeetei · December 29, 2007, 9:44am

Hi All
We are using 8800 GTX, Some says, It doesnot support Asynchrnous Calls, But when we experiment it shows it supports and when we run asyncAPI given in SDK under projects, It runs

Can Anybody of you please clearify
1)How To use this Asynchromous Calls with Events, We got wrong results on this
2)How to Calculate GPU time and CPU time,
Thanks for your reply
Dev

DenisR · December 29, 2007, 9:52am

the 8800 GTX does support async calls. And I believe in the async demo you can find timing examples. CudaEventRecord is used for calculating GPU time. CUT_TIMER_* related stuff (like in the bandwithtest example) calculates CPU time.

What the 8800 GTX does not support is transferring page-locked memory to/from the device while the device is executing a kernel.

paulius · December 29, 2007, 11:50pm

DenisR is correct.

Paulius

Sarnath · January 4, 2008, 5:36am

To add a little more to what Denis said :

When transfering memory from/to page-locked memory, the cudaMemcpy function offloads the job to the GPU. It is the GPU that needs to DMA-in/-out to the memory. (GPU does the transfer)
In the normal case, the cudaMemcpy function manually copies data from/to the GPU (CPU does the transfer)

Thats the difference.

wumpus · January 4, 2008, 5:43am

Indeed. In the case of non-page-locked memory, CUDA uses a pre-allocated DMA buffer, copies your data there, initiates a DMA transfer, synchronizes then copies the next chunk… and so on. This is much more involved than simply one fire and forget DMA operation, which happens if the memory is page-locked.

Both the 8800 and later cards support this kind of asynchronous operation, the only thing that changed with later cards is that they can overlap DMA and kernel execution.

tt3346 · January 10, 2008, 4:50pm

I’m having strange experience with asynchronous kernel launch on 8800 GTX. When I make 2 different kernel launch consecutively on a stream, only the last kernel launch is asynchronous. For example, let say I make two kernel launch in a stream, each takes 1s and 2s respectively to finish. After that I do some CPU computation that will take 5s. What I expect is that the whole thing will take 5s (3s of GPU computation will be run in parallel with the CPU computation). However, it actually takes 6 seconds. Only the second kernel launch (2s) is run concurrently with the CPU computation.

This only happens when the two kernels are different. If I use the same kernel (only change the grid / block size) then it works as expected. Any explanation?

llittlefish · April 11, 2008, 7:54am

But when I tried to query the device attribute by the following command

cuDeviceGetAttribute( &Value, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, cuDevice );

the return value is 1. It means that it does SUPPORT the feature that the device can concurrently copy memory between host and device while executing a kernel (I’m using GeForce 8800 GTX). So, does the cuDeviceGetAttribute function gives the wrong result ?

seb · April 11, 2008, 8:56am

This is apparently a known driver bug. Check out this entry here: http://forums.nvidia.com/index.php?showtop…65&#entry300365

llittlefish · April 11, 2008, 10:01am

Thanks a lot for this useful information :)

Chirality · April 11, 2008, 9:23pm

The kernels return control to your main thread very quickly, stacking themselves in a little queue on the GPU. So it should be 5 seconds if it’s the same kernel.

However, if it’s two kernels, I believe there’s an implicit synchronization for the GPU to load the new Kernel information.

Thus:

Kernel1<<<>>>()

Kernel2<<<>>>()

CPU_Computation()

is really

Kernel1<<<>>>()

cudaThreadSynchronize() -----blocks until Kernel 1 finishes

Kernel2<<<>>>()

CPU_Computation()

I don’t remember if this is solved by warming up both kernels first or not…In other words, if doing the following will change your timing results.

// warm up

Kernel1<<<>>>

Kernel2<<<>>>

cudaThreadSynchronize()

// time results

StartTimer()

Kernel1<<<>>>

Kernel2<<<>>>

cudaThreadSynchronize()

EndTimer()

MisterAnderson42 · April 12, 2008, 1:07am

You can perform 16 kernel calls in a row before there is an implicit cudaThreadSynchronize().

tachyon_john · April 12, 2008, 4:47am

Yes, 16 queue slots for G80 in my testing.

For the G92 cards the queue seems to be 24 slots deep.

Cheers,

John

Chirality · April 14, 2008, 5:50pm

Clearly I was wrong about the implicit Synchronize, then. How to explain the curious results of the poster I was replying to, though?

MisterAnderson42 · April 14, 2008, 10:56pm

OK, I guess I didn’t answer tt3346’s situation explicitly. The queue depth is 16 kernels, even if two different kernels are called. I verified it with this simple linux test program.

#include <stdio.h>

#include <sys/time.h>

__global__ void kernel_a(int *data)

    {

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int a = data[idx];

   for (int i = 0; i < 1000; i++)

        a = a * 25;

   data[idx] = a;

    }

__global__ void kernel_b(int *data)

    {

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int a = data[idx];

   for (int i = 0; i < 1000; i++)

        a = a * 50;

   data[idx] = a;

    }

int main()

    {

    int nblocks = 1000;

    int block_size = 256;

   // allocate data

    int *data;

    cudaMalloc((void**)&data, nblocks*block_size*sizeof(int));

   // warm up

    kernel_a<<<nblocks, block_size>>>(data);

    kernel_b<<<nblocks, block_size>>>(data);

    cudaThreadSynchronize();

   // print timings: all kernel a

    timeval start;

    gettimeofday(&start, NULL);

    for (int i = 1; i < 25; i++)

        {

        kernel_a<<<nblocks, block_size>>>(data);

        timeval end;

        gettimeofday(&end, NULL);

        int diff = (end.tv_sec - start.tv_sec)*1000000000 + (end.tv_usec - start.tv_usec);

        printf("kernel_a / i=%d / Time: %d us\n", i, diff);

        }

   // print timings: alternating a &b

    cudaThreadSynchronize();

    gettimeofday(&start, NULL);

   for (int i = 1; i < 15; i++)

        {

        kernel_a<<<nblocks, block_size>>>(data);

        kernel_b<<<nblocks, block_size>>>(data);

        timeval end;

        gettimeofday(&end, NULL);

        int diff = (end.tv_sec - start.tv_sec)*1000000000 + (end.tv_usec - start.tv_usec);

        printf("kernel_a&b / i=%d / Time: %d us\n", i, diff);

        }

   return 0;

    }

And I got the following output, as expected for a queue depth of 16 (or I guess is it 17?) kernel calls for both test cases.

kernel_a / i=1 / Time: 16 us

kernel_a / i=2 / Time: 254 us

kernel_a / i=3 / Time: 341 us

kernel_a / i=4 / Time: 424 us

kernel_a / i=5 / Time: 493 us

kernel_a / i=6 / Time: 559 us

kernel_a / i=7 / Time: 625 us

kernel_a / i=8 / Time: 691 us

kernel_a / i=9 / Time: 757 us

kernel_a / i=10 / Time: 827 us

kernel_a / i=11 / Time: 894 us

kernel_a / i=12 / Time: 960 us

kernel_a / i=13 / Time: 1025 us

kernel_a / i=14 / Time: 1092 us

kernel_a / i=15 / Time: 1158 us

kernel_a / i=16 / Time: 1224 us

kernel_a / i=17 / Time: 1290 us

kernel_a / i=18 / Time: 10244 us

kernel_a / i=19 / Time: 20454 us

kernel_a / i=20 / Time: 30691 us

kernel_a / i=21 / Time: 40903 us

kernel_a / i=22 / Time: 51111 us

kernel_a / i=23 / Time: 61319 us

kernel_a / i=24 / Time: 71540 us

kernel_a&b / i=1 / Time: 35 us

kernel_a&b / i=2 / Time: 132 us

kernel_a&b / i=3 / Time: 224 us

kernel_a&b / i=4 / Time: 316 us

kernel_a&b / i=5 / Time: 408 us

kernel_a&b / i=6 / Time: 499 us

kernel_a&b / i=7 / Time: 591 us

kernel_a&b / i=8 / Time: 675 us

kernel_a&b / i=9 / Time: 10245 us

kernel_a&b / i=10 / Time: 30681 us

kernel_a&b / i=11 / Time: 51097 us

kernel_a&b / i=12 / Time: 71564 us

kernel_a&b / i=13 / Time: 92011 us

kernel_a&b / i=14 / Time: 112429 us

Topic		Replies	Views
two (newbie?) questions asynchroneous host->device memcpy+events CUDA Programming and Performance	22	21969	December 11, 2008
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20153	May 4, 2007
Maximum number of queued kernels CUDA Programming and Performance	21	7783	September 3, 2008
ASYNCHRONOUS CALLS how do we FIGURE OUT this CUDA Programming and Performance	0	3171	December 29, 2007
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3351	January 10, 2009
Why 8800 is faster? CUDA Programming and Performance	15	10270	May 13, 2009
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10663	January 4, 2011
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6652	July 2, 2010
Do the non-async calls sleep or burn CPU? CUDA Programming and Performance	20	22047	January 13, 2008
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12733	July 26, 2010

ASYNCHRONOUS CALLS how do we FIGURE OUT this

Related topics