cudaFreeHost is slow on 270.41.19 (Ubuntu GNU/Linux 64-bit) and GTX580

fhellman · September 23, 2011, 8:11am

Hi!

Between Linux driver release 260.19.21 and 270.41.19 there seems to have been some modifications in the procedure for allocating and freeing memory using cudaHostAlloc and cudaFreeHost. Using the new driver with a GTX580-card causes the time spent freeing memory to increase ten-fold and the time allocating memory increases with a factor of two.

The doubling of the time in allocating memory is due to a security hole (About Secunia Research | Flexera) fix. The fix sets all memory allocated by cudaHostAlloc to zero. However, from where comes the ten-fold time increase for freeing memory?

I have made some experiments with different GPU:s and driver versions. The times for allocating and freeing memory using cudaHostAlloc and cudaFreeHost respectively for the four combinations of card=[GTX285, GTX580] and driver=[260.19.21, 270.41.19] have been recorded with the following results.

On a GTX580, allocating 8GB pinned, driver version 260.19.21:

$./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 0.510638 s.

cudaFreeHost time: 0.358069 s.

On a GTX580, allocating 8GB pinned, driver version 270.41.19:

$./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.442341 s.

cudaFreeHost time: 3.643809 s.

On a GTX285, allocating 8GB pinned, driver version 260.19.21:

$./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 0.784601 s.

cudaFreeHost time: 0.524070 s.

On a GTX285, allocating 8GB pinned, driver version 270.41.19:

$ ./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.565684 s.

cudaFreeHost time: 0.442333 s.

The figures above indicate that there is some extra work being done on free when one has the combination GTX580 and 270.41.19. Note that the time for the combination GTX285 and 270.41.19 is lower than that for GTX285 and 260.19.21.

Does anyone know how to get rid of this overhead on the deallocation time?

The source code for a.out above is:

#include <unistd.h>

#include <string.h>

#include <stdio.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <time.h>

double getTime()

{

    struct timespec ts;

    clock_gettime(CLOCK_REALTIME, &ts);

    return ts.tv_sec + ts.tv_nsec * 1.0e-9;

}

/*

 * $ a.out size device

 */

int main(int argc, char *argv[])

{

    char *mem;

    size_t size;

    int device;

    double start, end;

size = strtoull(argv[1], NULL, 0);

    device = strtoull(argv[2], NULL, 0);

printf("size: %lu\n", size);

    printf("device: %d\n", device);

if(cudaSetDevice(device)) {

        printf("error 1\n");

        exit(1);

    }

{

        char *m1;

        cudaMalloc(&m1, 1);

        cudaFree(m1);

    }

start = getTime();

    if(cudaHostAlloc(&mem, size, 0)) {

        printf("error 2\n");

        exit(1);

    }

    end = getTime();

    printf("cudaHostAlloc time: %f s.\n", end-start);

start = getTime();

    cudaFreeHost(mem);

    end = getTime();

    printf("cudaFreeHost time: %f s.\n", end-start);

return 0;

}

fhellman · September 23, 2011, 8:24am

I just tried GTX580 with drivers 275.28 and 285.03:

On a GTX580, allocating 8GB pinned, driver version 275.28:

$ ./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.447510 s.

cudaFreeHost time: 3.645830 s.

On a GTX580, allocating 8GB pinned, driver version 285.03:

$ ./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.715932 s.

cudaFreeHost time: 3.741525 s.

So, no good news there.

Topic		Replies	Views
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	960	July 26, 2022
I tried comparing conventional cudaMalloc with cudaHostAlloc for DMA, result is weird CUDA Programming and Performance	0	437	July 9, 2020
cudaHostAlloc can only allocate about 3.5GB of memory out of 128GB CUDA Programming and Performance	7	449	June 2, 2023
Using cudaHostAlloc CUDA Programming and Performance	0	6492	May 9, 2011
Problem with cudaMallocHost CUDA Programming and Performance	3	7914	April 23, 2009
cudaMalloc segfaulting Possible cause? CUDA Programming and Performance	7	4011	September 26, 2008
GTX 680 fails after large cudaHostAllocaPortable allocation CUDA Programming and Performance	11	3770	June 13, 2012
cudaFreeHost very slow for buffers > 2GB [newbee here!] CUDA Programming and Performance	30	2512	October 3, 2019
Large memory allocation with CudaHostAlloc fails with CUDA 8.0 release build CUDA Programming and Performance	23	4397	January 29, 2018
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2138	August 26, 2011

cudaFreeHost is slow on 270.41.19 (Ubuntu GNU/Linux 64-bit) and GTX580

Related topics