Tesla C1060 Allocation Problem

We are running on a Boxx PSC, with 4 Tesla C1060s. The OS is XP-64. We calculated the maximum memory allocation for 1 GPU to be maxMem = 4232800000 bytes. The data we are processing exceeds this, so we are partitioning our data and putting it on different GPUs. We are able to partition our data in a chunk that is very close to the max memory: partitionMem = 4232799480 bytes. If we call 1 alloc of size partitionMem, this works fine and we’re on our way. However, if we call multiple smaller allocs (where the sum of the smaller allocs = partitionMem), this fails.

[codebox]int partitionMem = 4232799480;

int memSlice = (int)partitionMem/4;

//Assume we are only doing either Full or Partial, not both in a single program execution

//The psuedo-code below just summarizes my text description above

//Full Alloc

state = cublasAlloc(partitionMem, …); // Works fine

//Partial Allocs

for(int i = 0; i<4; i++){

state = cublasAlloc(memSlice, …); // Fails

}

[/codebox]

Doing 1 large alloc works fine (using offsets for the different matrices), but this seems worthy of noting.

I would imagine that an alloc assigns a contiguous block of device memory. So if the first alloc of +/- half the total memory size creates storage somewhere in the middle of the device memory’s address space, then the second alloc of the same size will not find a suitable contiguous block of the requested size in memory.
Just a guess though…

N.