Memcpy3DAsync not fully asynchronous. Maximum size for full asynchronous call?

Here is a piece of code I wrote to check an issue I was getting with Memcpy3DAsync.

clock_t last_time = 0;

void timestamp(char* message) {
clock_t current_time = (clock()*1000) / CLOCKS_PER_SEC;
fprintf(stderr,"%s +%dms (overall time=%dms)\n",message,current_time - last_time,current_time);

void main() {
cudaArray *d_volumeArray = NULL;
cudaExtent volumeSize;
volumeSize.depth = 128;
volumeSize.height = 512;
volumeSize.width = 512;

cudaPitchedPtr *pagelockedPtr = new cudaPitchedPtr;
pagelockedPtr->pitch = volumeSize.width*sizeof(unsigned short);
pagelockedPtr->xsize = volumeSize.width;
pagelockedPtr->ysize = volumeSize.height;
size_t size = volumeSize.width*volumeSize.height*volumeSize.depth*sizeof(unsigned short);
SAFE_CALL( cudaMallocHost(&(pagelockedPtr->ptr), size) );

int e = (int)sizeof(unsigned short) * 8;
cudaChannelFormatDesc volumeChannelDesc = cudaCreateChannelDesc(e, 0, 0, 0, cudaChannelFormatKindUnsigned);
SAFE_CALL( cudaMalloc3DArray(&d_volumeArray, &volumeChannelDesc, volumeSize) );
timestamp("done mallocing");

cudaMemcpy3DParms copyParams = {0};
copyParams.srcPtr   = *pagelockedPtr;
copyParams.dstArray = d_volumeArray;
copyParams.extent   = volumeSize;
copyParams.kind     = cudaMemcpyHostToDevice;

timestamp("done issuing memory copy");


timestamp("completed memory copy");



When running like that issuing the copy is taking no time.
However when changing the volumeSize.depth above 128 It starts taking time.
In my case moving to 256 generates a 15ms time for the issue and an additional 16ms for the completed copy.

Is there any hope that in the future we might see a fully asynchronous call for larger blocks or is it some kind of hardware limitation on a 64MB block?


Edit: XP64. GTX280. Cuda 2.0 beta 2 Driver 177.41