Problem with Memory allocation in GTX 295 multiThread out of memory when allocation of more than 768

Hi every1,

I got a problem when i allocate global memory in GTX295, in fact i follow the SDK "simpleMultiGPU ", i try
to allocat 3*256 MB in each device of GTX295 who has 2 GPU and 2896MB, and the "out of memory "
happened. any advices here will be appreciated.

code:

static CUT_THREADPROC GPUThread(MultiGPUData *plan){

float* d_A;
float* d_B;
float* d_C;

cutilSafeCall( cudaSetDevice(plan->device));

/* Allocate device memory for the matrices */ 
            //plan->n2 =8196*8196 A B C =256MB
cutilSafeCall( cudaMalloc((void**)&d_A, plan->n2 * sizeof(float)) );
cutilSafeCall( cudaMalloc((void**)&d_B, plan->n2 * sizeof(float)) );
cutilSafeCall( cudaMalloc((void**)&d_C, plan->n2 * sizeof(float)) );
           
/*free memory */
            cutilSafeCall(cudaFree(d_A));
cutilSafeCall(cudaFree(d_B));
cutilSafeCall(cudaFree(d_C));
CUT_THREADEND;

}
//create 2 threads for 2 device , each 1 allocate 3*256MB in each device
main{

for(i = 0; i < 2; i++){

                   threadID[i] = cutStartThread((CUT_THREADROUTINE)GPUThread, (void *)(plan + i));
                 [b] [u]sleep(1000);[/u] [/b] 
         }

cutWaitForThreads(threadID, 2);
            ...

}

Problem found and solved

solution: add sleep(1000) between each thread, but some1 could explain it?
i guess that it needs some time to start another thread.
thanks for ur explanation.