ASYNCHRONOUS CALLS how do we FIGURE OUT this

Hi All
We are using 8800 GTX, Some says, It doesnot support Asynchrnous Calls, But when we experiment it shows it supports and when we run asyncAPI given in SDK under projects, It runs

Can Anybody of you please clearify
1)How To use this Asynchromous Calls with Events, We got wrong results on this
2)How to Calculate GPU time and CPU time,
Thanks for your reply
Dev

the 8800 GTX does support async calls. And I believe in the async demo you can find timing examples. CudaEventRecord is used for calculating GPU time. CUT_TIMER_* related stuff (like in the bandwithtest example) calculates CPU time.

What the 8800 GTX does not support is transferring page-locked memory to/from the device while the device is executing a kernel.

DenisR is correct.

Paulius

To add a little more to what Denis said :

  1. When transfering memory from/to page-locked memory, the cudaMemcpy function offloads the job to the GPU. It is the GPU that needs to DMA-in/-out to the memory. (GPU does the transfer)

  2. In the normal case, the cudaMemcpy function manually copies data from/to the GPU (CPU does the transfer)

Thats the difference.

Indeed. In the case of non-page-locked memory, CUDA uses a pre-allocated DMA buffer, copies your data there, initiates a DMA transfer, synchronizes then copies the next chunk… and so on. This is much more involved than simply one fire and forget DMA operation, which happens if the memory is page-locked.

Both the 8800 and later cards support this kind of asynchronous operation, the only thing that changed with later cards is that they can overlap DMA and kernel execution.

I’m having strange experience with asynchronous kernel launch on 8800 GTX. When I make 2 different kernel launch consecutively on a stream, only the last kernel launch is asynchronous. For example, let say I make two kernel launch in a stream, each takes 1s and 2s respectively to finish. After that I do some CPU computation that will take 5s. What I expect is that the whole thing will take 5s (3s of GPU computation will be run in parallel with the CPU computation). However, it actually takes 6 seconds. Only the second kernel launch (2s) is run concurrently with the CPU computation.

This only happens when the two kernels are different. If I use the same kernel (only change the grid / block size) then it works as expected. Any explanation?

But when I tried to query the device attribute by the following command

cuDeviceGetAttribute( &Value, CU_DEVICE_ATTRIBUTE_GPU_OVERLAP, cuDevice );

the return value is 1. It means that it does SUPPORT the feature that the device can concurrently copy memory between host and device while executing a kernel (I’m using GeForce 8800 GTX). So, does the cuDeviceGetAttribute function gives the wrong result ?

This is apparently a known driver bug. Check out this entry here: http://forums.nvidia.com/index.php?showtop…65&#entry300365

Thanks a lot for this useful information :)

The kernels return control to your main thread very quickly, stacking themselves in a little queue on the GPU. So it should be 5 seconds if it’s the same kernel.

However, if it’s two kernels, I believe there’s an implicit synchronization for the GPU to load the new Kernel information.

Thus:

Kernel1<<<>>>()

Kernel2<<<>>>()

CPU_Computation()

is really

Kernel1<<<>>>()

cudaThreadSynchronize() -----blocks until Kernel 1 finishes

Kernel2<<<>>>()

CPU_Computation()

I don’t remember if this is solved by warming up both kernels first or not…In other words, if doing the following will change your timing results.

// warm up

Kernel1<<<>>>

Kernel2<<<>>>

cudaThreadSynchronize()

// time results

StartTimer()

Kernel1<<<>>>

Kernel2<<<>>>

cudaThreadSynchronize()

EndTimer()

You can perform 16 kernel calls in a row before there is an implicit cudaThreadSynchronize().

Yes, 16 queue slots for G80 in my testing.

For the G92 cards the queue seems to be 24 slots deep.

Cheers,

John

Clearly I was wrong about the implicit Synchronize, then. How to explain the curious results of the poster I was replying to, though?

OK, I guess I didn’t answer tt3346’s situation explicitly. The queue depth is 16 kernels, even if two different kernels are called. I verified it with this simple linux test program.

#include <stdio.h>

#include <sys/time.h>

__global__ void kernel_a(int *data)

    {

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int a = data[idx];

   for (int i = 0; i < 1000; i++)

        a = a * 25;

   data[idx] = a;

    }

__global__ void kernel_b(int *data)

    {

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int a = data[idx];

   for (int i = 0; i < 1000; i++)

        a = a * 50;

   data[idx] = a;

    }

int main()

    {

    int nblocks = 1000;

    int block_size = 256;

   // allocate data

    int *data;

    cudaMalloc((void**)&data, nblocks*block_size*sizeof(int));

   // warm up

    kernel_a<<<nblocks, block_size>>>(data);

    kernel_b<<<nblocks, block_size>>>(data);

    cudaThreadSynchronize();

   // print timings: all kernel a

    timeval start;

    gettimeofday(&start, NULL);

    for (int i = 1; i < 25; i++)

        {

        kernel_a<<<nblocks, block_size>>>(data);

        timeval end;

        gettimeofday(&end, NULL);

        int diff = (end.tv_sec - start.tv_sec)*1000000000 + (end.tv_usec - start.tv_usec);

        printf("kernel_a / i=%d / Time: %d us\n", i, diff);

        }

   // print timings: alternating a &b

    cudaThreadSynchronize();

    gettimeofday(&start, NULL);

   for (int i = 1; i < 15; i++)

        {

        kernel_a<<<nblocks, block_size>>>(data);

        kernel_b<<<nblocks, block_size>>>(data);

        timeval end;

        gettimeofday(&end, NULL);

        int diff = (end.tv_sec - start.tv_sec)*1000000000 + (end.tv_usec - start.tv_usec);

        printf("kernel_a&b / i=%d / Time: %d us\n", i, diff);

        }

   return 0;

    }

And I got the following output, as expected for a queue depth of 16 (or I guess is it 17?) kernel calls for both test cases.

kernel_a / i=1 / Time: 16 us

kernel_a / i=2 / Time: 254 us

kernel_a / i=3 / Time: 341 us

kernel_a / i=4 / Time: 424 us

kernel_a / i=5 / Time: 493 us

kernel_a / i=6 / Time: 559 us

kernel_a / i=7 / Time: 625 us

kernel_a / i=8 / Time: 691 us

kernel_a / i=9 / Time: 757 us

kernel_a / i=10 / Time: 827 us

kernel_a / i=11 / Time: 894 us

kernel_a / i=12 / Time: 960 us

kernel_a / i=13 / Time: 1025 us

kernel_a / i=14 / Time: 1092 us

kernel_a / i=15 / Time: 1158 us

kernel_a / i=16 / Time: 1224 us

kernel_a / i=17 / Time: 1290 us

kernel_a / i=18 / Time: 10244 us

kernel_a / i=19 / Time: 20454 us

kernel_a / i=20 / Time: 30691 us

kernel_a / i=21 / Time: 40903 us

kernel_a / i=22 / Time: 51111 us

kernel_a / i=23 / Time: 61319 us

kernel_a / i=24 / Time: 71540 us

kernel_a&b / i=1 / Time: 35 us

kernel_a&b / i=2 / Time: 132 us

kernel_a&b / i=3 / Time: 224 us

kernel_a&b / i=4 / Time: 316 us

kernel_a&b / i=5 / Time: 408 us

kernel_a&b / i=6 / Time: 499 us

kernel_a&b / i=7 / Time: 591 us

kernel_a&b / i=8 / Time: 675 us

kernel_a&b / i=9 / Time: 10245 us

kernel_a&b / i=10 / Time: 30681 us

kernel_a&b / i=11 / Time: 51097 us

kernel_a&b / i=12 / Time: 71564 us

kernel_a&b / i=13 / Time: 92011 us

kernel_a&b / i=14 / Time: 112429 us