cudaMemcpy3D problems with copying from host to device

I’m working on a 3D-convolution (with FFT) for smooth an image. For that reason I’m implementing the “replicate”-convolution, so I need to expand the image by copying border-parts of my image into the expanded image. For that reason I used this kind of code:

cudaMemcpy3DParms p3 = { 0 };
p3.srcPtr.ptr = h_InputImageSeq;
p3.srcPtr.pitch = ImageHeight * sizeof(float);
p3.srcPtr.xsize = ImageHeight;
p3.srcPtr.ysize = ImageWidth;
//p3.srcPos.x = 5 * sizeof(float); //offset width
p3.srcPos.x = 3 * sizeof(float); //offset width
p3.srcPos.y = 2; //offset height
p3.srcPos.z = 0; //offset length

p3.dstPtr.ptr = device_ExpandedSeqReplicate.ptr;
p3.dstPtr.pitch = device_ExpandedSeqReplicate.pitch;
p3.dstPtr.xsize = ExpandImgHeight;
p3.dstPtr.ysize = ExpandImgWidth+2;
p3.dstPos.x = 0 * sizeof(float); //offset width
p3.dstPos.y = 2; //offset height
p3.dstPos.z = 0; //offset length

p3.extent.width = 1 * sizeof(float);
p3.extent.height = 2;
p3.extent.depth = 1;
//p3.extent.height = ImageWidth;
//p3.extent.depth = ImageLength;
p3.kind = cudaMemcpyHostToDevice;
state = cudaMemcpy3D(&p3);
if(state != CUDA_SUCCESS){
return false;

Well, this code is not the aim I wanted, I just tried to play with the code and find out how it works. But whatever I set into srcPos.x/srcPos.y/srcPos.z it copies always the same (same startpixel). If I try the same thing, but with copying directly from device to device, it seems to work!? Does anybody knows why?

It might be easier to write a kernel that copies your borders, but cudaMemcpy3D should also work.