We are running on a Boxx PSC, with 4 Tesla C1060s. The OS is XP-64. We calculated the maximum memory allocation for 1 GPU to be maxMem = 4232800000 bytes. The data we are processing exceeds this, so we are partitioning our data and putting it on different GPUs. We are able to partition our data in a chunk that is very close to the max memory: partitionMem = 4232799480 bytes. If we call 1 alloc of size partitionMem, this works fine and we’re on our way. However, if we call multiple smaller allocs (where the sum of the smaller allocs = partitionMem), this fails.
[codebox]int partitionMem = 4232799480;
int memSlice = (int)partitionMem/4;
//Assume we are only doing either Full or Partial, not both in a single program execution
//The psuedo-code below just summarizes my text description above
//Full Alloc
state = cublasAlloc(partitionMem, …); // Works fine
//Partial Allocs
for(int i = 0; i<4; i++){
state = cublasAlloc(memSlice, …); // Fails
}
[/codebox]
Doing 1 large alloc works fine (using offsets for the different matrices), but this seems worthy of noting.