I have been working on an application that uses pageable memory when only 1 device is used and pinned memory when 2 devices will be used.
I have come across that freeing pageable memory is faster than freeing pinned memory. Pageable memory is taking practically noting. Instead pinned is taking 0.228 sg.
Is this behaviour normal or maybe I am doing something wrong.
I am using CUDA 4.0 with the 4.1 CUDA driver version. Each node in the cluster has two Testa M2090 (compute capability 2.0). My OS is Red Hat Enterprise Linux Server release 5.3 (Tikanga) x86_64 version and PCI-Express version is 2.0.
Source code that I am using to timing.
clock_gettime( CLOCK_REALTIME, &rghstsStartFreeHost[i] );
if( memType == PINNED )
CHECK( cudaFreeHost( rgdv[i].plhsMatrixOut ) );
free( rgdv[i].pghsMatrixOut );
clock_gettime( CLOCK_REALTIME, &rghstsStopFreeHost[i] );
Thanks in advance.
Nobody? Well, I will walk around and do some memory pool. Nevertheless, It would be nice to know how to fix it (if possible).
I don’t know the details of the implementation, but semantically memory is pinned (locked) at the page level, which is typically 4K (run ‘getconf PAGESIZE’ to check). Thus, the kernel is likely updating data structures to unlock each 4K page, which is a fairly heavyweight operation.
Freeing pageable host memory is much easier – that can be done in userspace by just adding the block to the free memory list. Depending on the allocator, freeing memory from your application need not even involve a system call.
Having said that, I wrote a quick test that allocates and then frees pageable and pinned host memory and ran that application with strace to track the system calls being made. I did not see what I expected (i.e., there were no calls to mlock() or munlock()), but rather only a couple calls to mmap()), but it may be that the nvidia kernel driver is maintaining the mapping.
In any case, I would expect pinned memory to have higher overhead, but I don’t have a gut feeling for how long it should take. Similarly, using cudaHostRegister()/cudaHostUnregister() to wrap a cudaMemcpy() is slower than just calling cudaMemcpy() with the pageable address (or has been when I tested it). I typically allocate my pinned memory during application start and do not free it until application shutdown.
Thank you Thomas for you response. I have done what you have said, free memory only at the end). It would be great if some Nvidia Technician confirmed that.