Memory fragmentation after allocation of small block of memory

I have the following problem with my GTX970 (4GB). After some allocations, the memory is highly fragmented after allocation of a small block of memory (8kb). I use the following code to reproduce the effect (compiled 64bit)

#include "cuda_runtime.h"

#include <stdio.h>
#include <vector>

size_t getLargestFreeBlock(std::vector<void*>& allocatedBlocks)
{
    void* block;

    size_t lastAllocableBlockSize = 0;
    size_t currentBlockSize = (size_t)8*1024*1024;
    size_t currentStepSize = currentBlockSize/2;

    do
    {
        cudaError_t err = cudaMalloc (&block, currentBlockSize*1024);
        if ( err == cudaSuccess )
        {
            lastAllocableBlockSize = currentBlockSize;
            cudaFree (block);
            currentBlockSize += currentStepSize;
        }
        else
        {
            currentBlockSize -= currentStepSize;
        }
        currentStepSize >>= 1;
        
    } while (currentStepSize>0);

    cudaMalloc (&block, lastAllocableBlockSize*1024);

    allocatedBlocks.push_back (block);

    return lastAllocableBlockSize*1024;
}

void printFreeCudaMem()
{
    std::vector<void*> allocatedBlocks;
    size_t allocatedTotal = 0;
    printf("\nChecking memory:\n");

    do
    {
        size_t largestBlock = getLargestFreeBlock(allocatedBlocks);
        if ( largestBlock == 0 ) break;
        printf ("allocated block of %I64d bytes\n", largestBlock);
        allocatedTotal += largestBlock;
    } while (true);
    printf ("---------------------------------\n");
    printf ("allocated a total of %I64d bytes\n", allocatedTotal);
    for ( int i = 0; i < allocatedBlocks.size(); i++ )
    {
        cudaFree (allocatedBlocks[i]);
    }
}

int main()
{
    cudaDeviceReset();

    void* buffer1;
    cudaMalloc(&buffer1, (size_t)2*1024*1024*1024+87320576);

    printFreeCudaMem();

    void* buffer2;
    cudaMalloc(&buffer2, 8192);

    printFreeCudaMem();

    return 0;
}

On my system, I get the following output:

Checking memory:
allocated block of 1903413248 bytes
---------------------------------
allocated a total of 1903413248 bytes

Checking memory:
allocated block of 1902364672 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 16384 bytes
allocated block of 8192 bytes
---------------------------------
allocated a total of 1903405056 bytes
Press any key to continue . . .

Any ideas what the reason is and how it can be avoided? Or is there a bug in printFreeCudaMem?

Bye,
Ingo

I fail to see how the output shown above is indicative of a “highly fragmented” memory. Could you clarify, please? What operating system is running on this platform?

Well, after allocating two blocks of memory of ~2.1GB and 8kb, I’d expect that the remaining free memory of ~1.9GB can be allocated as a whole, as it is possible before I allocate these 8kb. But afterwards, the remaining free memory can only be allocated in one huge block, an 64 small blocks.
The operating system is Windows 7 64bit.

Bye,
Ingo