Can anybody explain to me why does cudaMallocHost takes so muck time compared to malloc?
Is it because the memory locations have to be continuous?
What is the series of operation in a cudaMallocHost?
Thank you for your time
Apostolis Glenis
Can anybody explain to me why does cudaMallocHost takes so muck time compared to malloc?
Is it because the memory locations have to be continuous?
What is the series of operation in a cudaMallocHost?
Thank you for your time
Apostolis Glenis
Just to check something: Can you run two cudaMallocHost() functions in a row and time only the second call? I wonder if you are also seeing the CUDA context creation overhead in the first call.
i haven’t thought of that.
Thank you
But i would also like an answer to my other question,because according to my tests you need approximately 1GB dataset to overcome the overhead by cudaHostAlloc.
Apostolis
Something is seriously wrong with either your tests or your CUDA system if that is the case. My testing with 64 bit Linux running CUDA 4.0 release version shows a single cudaMallocHost or cudaHostAlloc call takes less than a millisecond to complete. You are not seriously suggesting that it requires 1Gb of transfers to amortize 700 microseconds of host API latency, are you?
For reference, this is the code I used to benchmark the calls:
#include <sys/time.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
double systimer(void)
{
struct timeval tp;
register double result=0.0;
if ( gettimeofday(&tp, NULL) != -1 )
{
result = (double)(tp.tv_sec);
result += ((double)(tp.tv_usec))*1.0e-6;
}
return result;
}
int main(void)
{
const size_t bsize = 2<<20; // 2 Mib buffer
const int nbuffers = 10;
float *buffers[nbuffers];
(void)cudaFree(0); // Establish context
// cudaMallocHost timing
{
double stime = systimer();
for(int i=0; i<nbuffers;i++) {
assert( cudaMallocHost(&buffers[i], bsize) == cudaSuccess );
}
double etime = systimer();
fprintf(stdout,"%d cudaMallocHost calls of %ld bytes each cost %f\n",
nbuffers, long(bsize), etime-stime);
for(int i=0; i<nbuffers;i++) {
assert( cudaFreeHost(buffers[i]) == cudaSuccess );
}
}
// cudaHostAlloc timing
{
double stime = systimer();
for(int i=0; i<nbuffers;i++) {
assert( cudaHostAlloc(&buffers[i], bsize, cudaHostAllocDefault)
== cudaSuccess );
}
double etime = systimer();
fprintf(stdout,"%d cudaHostAlloc calls of %ld bytes each cost %f\n",
nbuffers, long(bsize), etime-stime);
for(int i=0; i<nbuffers;i++) {
assert( cudaFreeHost(buffers[i]) == cudaSuccess );
}
}
cudaThreadExit();
}
cuda:~$ ./cudamalloc
10 cudaMallocHost calls of 2097152 bytes each cost 0.007455
10 cudaHostAlloc calls of 2097152 bytes each cost 0.007547
cuda:~$ ./cudamalloc
10 cudaMallocHost calls of 2097152 bytes each cost 0.007537
10 cudaHostAlloc calls of 2097152 bytes each cost 0.007319
cuda:~$ ./cudamalloc
10 cudaMallocHost calls of 2097152 bytes each cost 0.007265
10 cudaHostAlloc calls of 2097152 bytes each cost 0.007397
What I was getting at is that the first CUDA function call (whatever it is) in your code has to incur the overhead of context creation. If cudaMallocHost is the first CUDA call in your code, then it takes all the overhead, but all later calls will be quicker. That is different than cudaMallocHost having some intrinsic overhead whenever you use it.
I am not sure you understood where I was going with it.
using the two files i am attaching i got the following results:
time elapsed to allocate 128 MB using cudaMallocHost was 39.915905 milliseconds
time elapsed to clean 128 MB using cudaMallocHost was 48.702785 milliseconds
time elapsed to allocate 128 MB using Malloc was 0.011872 milliseconds
time elapsed to clean 128 MB using free was 0.008256 milliseconds
time elapsed to allocate 256 MB using cudaMallocHost was 79.764030 milliseconds
time elapsed to clean 256 MB using cudaMallocHost was 97.940384 milliseconds
time elapsed to allocate 256 MB using Malloc was 0.028352 milliseconds
time elapsed to clean 256 MB using free was 0.015264 milliseconds
time elapsed to allocate 512 MB using cudaMallocHost was 158.175903 milliseconds
time elapsed to clean 512 MB using cudaMallocHost was 195.915619 milliseconds
time elapsed to allocate 512 MB using Malloc was 0.027616 milliseconds
time elapsed to clean 512 MB using free was 0.014560 milliseconds
time elapsed to allocate 768 MB using cudaMallocHost was 241.382751 milliseconds
time elapsed to clean 768 MB using cudaMallocHost was 292.583832 milliseconds
time elapsed to allocate 768 MB using Malloc was 0.035904 milliseconds
time elapsed to clean 768 MB using free was 0.017696 milliseconds
time elapsed to allocate 1024 MB using cudaMallocHost was 316.563416 milliseconds
time elapsed to clean 1024 MB using cudaMallocHost was 391.105652 milliseconds
time elapsed to allocate 1024 MB using Malloc was 0.026240 milliseconds
time elapsed to clean 1024 MB using free was 0.015936 milliseconds
time elapsed to allocate 128 MB using Malloc and then pin was 41.733601 milliseconds
time elapsed to clean 128 MB using unpin and then free was 49.517345 milliseconds
time elapsed to allocate 256 MB using Malloc and then pin was 84.328445 milliseconds
time elapsed to clean 256 MB using unpin and then free was 99.042114 milliseconds
time elapsed to allocate 512 MB using Malloc and then pin was 168.566620 milliseconds
time elapsed to clean 512 MB using unpin and then free was 198.179459 milliseconds
time elapsed to allocate 768 MB using Malloc and then pin was 253.400162 milliseconds
time elapsed to clean 768 MB using unpin and then free was 296.367126 milliseconds
time elapsed to allocate 1024 MB using Malloc and then pin was 339.775055 milliseconds
time elapsed to clean 1024 MB using unpin and then free was 397.423126 milliseconds
time elapsed to allocate 128 MB using Malloc and then pin was 41.725792 milliseconds
time elapsed to clean 128 MB using unpin and then free was 49.381985 milliseconds
time elapsed to allocate 256 MB using Malloc and then pin was 84.429916 milliseconds
time elapsed to clean 256 MB using unpin and then free was 98.788963 milliseconds
time elapsed to allocate 512 MB using Malloc and then pin was 175.949432 milliseconds
time elapsed to clean 512 MB using unpin and then free was 197.737122 milliseconds
time elapsed to allocate 1024 MB using Malloc and then pin was 336.332123 milliseconds
time elapsed to clean 1024 MB using unpin and then free was 396.497070 milliseconds
I mean’t that cudaMallocHost takes that much time more than simple malloc that to compensate for that overhead via the faster transfer (which happens 2 times usually,h2d and then d2h) the dataset should be bigger than 700MB or so.(assumming 1 two way communication as i said).
Could anybody explain to me why the cuda runtime takes that much time to pin a region of memory?
Thank you in advance
Apostolis
cudaHost.cu (1.14 KB)
cudaHost2.cu (869 Bytes)
Found the measurements i was looking for:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html
Can anybody explain to me why this is happening?
Apostolis
Thanks Tim, that was very helpful and informative.
BTW i was under the impression that the mapping to GPU address space was done when asked by the runtime to provide device pointer for the pinned memory (to do Zero-Copy) not when the memory was allocated on the host.
It’s mapped either way so the GPU can DMA to/from the buffer. Zero-copy implies some additional requirements, though.