Pitch Linear Reads Wondering how to read Pitch Linear Memory and write to non-pitch mem

So the example code from the CUDA sdk covers reading from a 2D texture references bound to a pitch linear memory region, and just looping through all the entries in a pitch linear memory region, however, I can’t seem to read from a pitch linear memory region and write to a linear memory region:

#define height 50

#define width 50

....

// Parallel Kernel

// MPPtr = cudamallocpitch memory

// MPtr = cudamallocd memory

__global__ void kernel_wo_loop(float* devMPPtr, float * devMPtr, int pitch)

{

		unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;

		unsigned int idy = blockIdx.y*blockDim.y + threadIdx.y;

		devMPtr[idy*width+idx]=devMPPtr[idy*pitch+idx]+2.0f;

}

...

int main(){

...

float* devMPPtr;

float* devMPtr;

size_t pitch;

cudaMallocPitch((void**)&devMPPtr,&pitch, width * sizeof(float), height);

...

cudaMalloc((void**)&devMPtr,memsize*sizeof(float));

...

// possibly use dim3 grid(width/BLOCKSIZE,height/BLOCKSIZE), threads(BLOCKSIZE,BLOCKSIZE) ->

//kernel_wo_loop<<<grid,threads>>>(devMPPtr, devMPtr, pitch);

kernel_wo_loop<<<100,512>>>(devMPPtr, devMPtr, pitch);

// Copy back data to host

cudaMemcpy(h_out, devMPtr,memsize*sizeof(float),cudaMemcpyDeviceToHost);

...

Running the above example will result in the Cuda error: unspecified launch failure at the cudaMemcpy NOT the kernel. So my question is how do I access elements of the pitch linear memory array correctly? I’m guessing it has to do with the “<<<100,512>>>” - but I’m not sure how to set it up - should I do:

dim3 grid(width/BLOCKSIZE+1,height/BLOCKSIZE+1), threads(BLOCKSIZE,BLOCKSIZE)

(i would change the 1 to a variable but you get the idea since 50%16 !=0)

I want to avoid a texture read for now (but maybe its not possible without one- I don’t know). Somebody know how to read from a pitch linear memory region and write to a regular memory region? I can do this find with the for loop in the Programming guide, but what a waste of the gpu!