Hi All
We are using 8800 GTX, Some says, It doesnot support Asynchrnous Calls, But when we experiment it shows it supports and when we run asyncAPI given in SDK under projects, It runs
Can Anybody of you please clearify
1)How To use this Asynchromous Calls with Events, We got wrong results on this
2)How to Calculate GPU time and CPU time,
Thanks for your reply
Dev
the 8800 GTX does support async calls. And I believe in the async demo you can find timing examples. CudaEventRecord is used for calculating GPU time. CUT_TIMER_* related stuff (like in the bandwithtest example) calculates CPU time.
What the 8800 GTX does not support is transferring page-locked memory to/from the device while the device is executing a kernel.
When transfering memory from/to page-locked memory, the cudaMemcpy function offloads the job to the GPU. It is the GPU that needs to DMA-in/-out to the memory. (GPU does the transfer)
In the normal case, the cudaMemcpy function manually copies data from/to the GPU (CPU does the transfer)
Indeed. In the case of non-page-locked memory, CUDA uses a pre-allocated DMA buffer, copies your data there, initiates a DMA transfer, synchronizes then copies the next chunk… and so on. This is much more involved than simply one fire and forget DMA operation, which happens if the memory is page-locked.
Both the 8800 and later cards support this kind of asynchronous operation, the only thing that changed with later cards is that they can overlap DMA and kernel execution.
I’m having strange experience with asynchronous kernel launch on 8800 GTX. When I make 2 different kernel launch consecutively on a stream, only the last kernel launch is asynchronous. For example, let say I make two kernel launch in a stream, each takes 1s and 2s respectively to finish. After that I do some CPU computation that will take 5s. What I expect is that the whole thing will take 5s (3s of GPU computation will be run in parallel with the CPU computation). However, it actually takes 6 seconds. Only the second kernel launch (2s) is run concurrently with the CPU computation.
This only happens when the two kernels are different. If I use the same kernel (only change the grid / block size) then it works as expected. Any explanation?
the return value is 1. It means that it does SUPPORT the feature that the device can concurrently copy memory between host and device while executing a kernel (I’m using GeForce 8800 GTX). So, does the cuDeviceGetAttribute function gives the wrong result ?
The kernels return control to your main thread very quickly, stacking themselves in a little queue on the GPU. So it should be 5 seconds if it’s the same kernel.
However, if it’s two kernels, I believe there’s an implicit synchronization for the GPU to load the new Kernel information.
Thus:
Kernel1<<<>>>()
Kernel2<<<>>>()
CPU_Computation()
is really
Kernel1<<<>>>()
cudaThreadSynchronize() -----blocks until Kernel 1 finishes
Kernel2<<<>>>()
CPU_Computation()
I don’t remember if this is solved by warming up both kernels first or not…In other words, if doing the following will change your timing results.
OK, I guess I didn’t answer tt3346’s situation explicitly. The queue depth is 16 kernels, even if two different kernels are called. I verified it with this simple linux test program.
#include <stdio.h>
#include <sys/time.h>
__global__ void kernel_a(int *data)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int a = data[idx];
for (int i = 0; i < 1000; i++)
a = a * 25;
data[idx] = a;
}
__global__ void kernel_b(int *data)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int a = data[idx];
for (int i = 0; i < 1000; i++)
a = a * 50;
data[idx] = a;
}
int main()
{
int nblocks = 1000;
int block_size = 256;
// allocate data
int *data;
cudaMalloc((void**)&data, nblocks*block_size*sizeof(int));
// warm up
kernel_a<<<nblocks, block_size>>>(data);
kernel_b<<<nblocks, block_size>>>(data);
cudaThreadSynchronize();
// print timings: all kernel a
timeval start;
gettimeofday(&start, NULL);
for (int i = 1; i < 25; i++)
{
kernel_a<<<nblocks, block_size>>>(data);
timeval end;
gettimeofday(&end, NULL);
int diff = (end.tv_sec - start.tv_sec)*1000000000 + (end.tv_usec - start.tv_usec);
printf("kernel_a / i=%d / Time: %d us\n", i, diff);
}
// print timings: alternating a &b
cudaThreadSynchronize();
gettimeofday(&start, NULL);
for (int i = 1; i < 15; i++)
{
kernel_a<<<nblocks, block_size>>>(data);
kernel_b<<<nblocks, block_size>>>(data);
timeval end;
gettimeofday(&end, NULL);
int diff = (end.tv_sec - start.tv_sec)*1000000000 + (end.tv_usec - start.tv_usec);
printf("kernel_a&b / i=%d / Time: %d us\n", i, diff);
}
return 0;
}
And I got the following output, as expected for a queue depth of 16 (or I guess is it 17?) kernel calls for both test cases.
kernel_a / i=1 / Time: 16 us
kernel_a / i=2 / Time: 254 us
kernel_a / i=3 / Time: 341 us
kernel_a / i=4 / Time: 424 us
kernel_a / i=5 / Time: 493 us
kernel_a / i=6 / Time: 559 us
kernel_a / i=7 / Time: 625 us
kernel_a / i=8 / Time: 691 us
kernel_a / i=9 / Time: 757 us
kernel_a / i=10 / Time: 827 us
kernel_a / i=11 / Time: 894 us
kernel_a / i=12 / Time: 960 us
kernel_a / i=13 / Time: 1025 us
kernel_a / i=14 / Time: 1092 us
kernel_a / i=15 / Time: 1158 us
kernel_a / i=16 / Time: 1224 us
kernel_a / i=17 / Time: 1290 us
kernel_a / i=18 / Time: 10244 us
kernel_a / i=19 / Time: 20454 us
kernel_a / i=20 / Time: 30691 us
kernel_a / i=21 / Time: 40903 us
kernel_a / i=22 / Time: 51111 us
kernel_a / i=23 / Time: 61319 us
kernel_a / i=24 / Time: 71540 us
kernel_a&b / i=1 / Time: 35 us
kernel_a&b / i=2 / Time: 132 us
kernel_a&b / i=3 / Time: 224 us
kernel_a&b / i=4 / Time: 316 us
kernel_a&b / i=5 / Time: 408 us
kernel_a&b / i=6 / Time: 499 us
kernel_a&b / i=7 / Time: 591 us
kernel_a&b / i=8 / Time: 675 us
kernel_a&b / i=9 / Time: 10245 us
kernel_a&b / i=10 / Time: 30681 us
kernel_a&b / i=11 / Time: 51097 us
kernel_a&b / i=12 / Time: 71564 us
kernel_a&b / i=13 / Time: 92011 us
kernel_a&b / i=14 / Time: 112429 us