cudaFreeHost is slow on 270.41.19 (Ubuntu GNU/Linux 64-bit) and GTX580

Hi!

Between Linux driver release 260.19.21 and 270.41.19 there seems to have been some modifications in the procedure for allocating and freeing memory using cudaHostAlloc and cudaFreeHost. Using the new driver with a GTX580-card causes the time spent freeing memory to increase ten-fold and the time allocating memory increases with a factor of two.

The doubling of the time in allocating memory is due to a security hole (About Secunia Research | Flexera) fix. The fix sets all memory allocated by cudaHostAlloc to zero. However, from where comes the ten-fold time increase for freeing memory?

I have made some experiments with different GPU:s and driver versions. The times for allocating and freeing memory using cudaHostAlloc and cudaFreeHost respectively for the four combinations of card=[GTX285, GTX580] and driver=[260.19.21, 270.41.19] have been recorded with the following results.

On a GTX580, allocating 8GB pinned, driver version 260.19.21:

$./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 0.510638 s.

cudaFreeHost time: 0.358069 s.

On a GTX580, allocating 8GB pinned, driver version 270.41.19:

$./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.442341 s.

cudaFreeHost time: 3.643809 s.

On a GTX285, allocating 8GB pinned, driver version 260.19.21:

$./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 0.784601 s.

cudaFreeHost time: 0.524070 s.

On a GTX285, allocating 8GB pinned, driver version 270.41.19:

$ ./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.565684 s.

cudaFreeHost time: 0.442333 s.

The figures above indicate that there is some extra work being done on free when one has the combination GTX580 and 270.41.19. Note that the time for the combination GTX285 and 270.41.19 is lower than that for GTX285 and 260.19.21.

Does anyone know how to get rid of this overhead on the deallocation time?

The source code for a.out above is:

#include <unistd.h>

#include <string.h>

#include <stdio.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <time.h>

double getTime()

{

    struct timespec ts;

    clock_gettime(CLOCK_REALTIME, &ts);

    return ts.tv_sec + ts.tv_nsec * 1.0e-9;

}

/*

 * $ a.out size device

 */

int main(int argc, char *argv[])

{

    char *mem;

    size_t size;

    int device;

    double start, end;

size = strtoull(argv[1], NULL, 0);

    device = strtoull(argv[2], NULL, 0);

printf("size: %lu\n", size);

    printf("device: %d\n", device);

if(cudaSetDevice(device)) {

        printf("error 1\n");

        exit(1);

    }

{

        char *m1;

        cudaMalloc(&m1, 1);

        cudaFree(m1);

    }

start = getTime();

    if(cudaHostAlloc(&mem, size, 0)) {

        printf("error 2\n");

        exit(1);

    }

    end = getTime();

    printf("cudaHostAlloc time: %f s.\n", end-start);

start = getTime();

    cudaFreeHost(mem);

    end = getTime();

    printf("cudaFreeHost time: %f s.\n", end-start);

return 0;

}

I just tried GTX580 with drivers 275.28 and 285.03:

On a GTX580, allocating 8GB pinned, driver version 275.28:

$ ./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.447510 s.

cudaFreeHost time: 3.645830 s.

On a GTX580, allocating 8GB pinned, driver version 285.03:

$ ./a.out 8000000000 0

size: 8000000000

device: 0

cudaHostAlloc time: 1.715932 s.

cudaFreeHost time: 3.741525 s.

So, no good news there.