problem with cudaMemcpy3D from device to cudaArray test code attached

I recently upgraded to Ubuntu 9.04 and CUDA 2.2 and found my old code doesn’t work anymore.
After digging for two days, I found the problem occurs when copying with cudaMemcpy3D from unaligned (?I’m not sure if that’s the right term?) device linear memory to cudaArray.
In other words,
Grid=97x87x101 Host->cudaArray OK
Grid=97x87x101 Device->cudaArray FAIL
Grid=128x256x128 Host->cudaArray OK
Grid=128x256x128 Device->cudaArray OK

Could anyone confirm this? I wonder if I did something wrong.
I attached my test code which should ready to nvcc and run.

OS: Ubuntu 9.04 x64
GCC: 4.2
CUDA: 2.2
Driver: and

Any comments are appreciated. Thanks in advance!

PS: Sorry I couldn’t upload it as an attachement, so please follow the link here.