So the example code from the CUDA sdk covers reading from a 2D texture references bound to a pitch linear memory region, and just looping through all the entries in a pitch linear memory region, however, I can’t seem to read from a pitch linear memory region and write to a linear memory region:
#define height 50
#define width 50
....
// Parallel Kernel
// MPPtr = cudamallocpitch memory
// MPtr = cudamallocd memory
__global__ void kernel_wo_loop(float* devMPPtr, float * devMPtr, int pitch)
{
unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int idy = blockIdx.y*blockDim.y + threadIdx.y;
devMPtr[idy*width+idx]=devMPPtr[idy*pitch+idx]+2.0f;
}
...
int main(){
...
float* devMPPtr;
float* devMPtr;
size_t pitch;
cudaMallocPitch((void**)&devMPPtr,&pitch, width * sizeof(float), height);
...
cudaMalloc((void**)&devMPtr,memsize*sizeof(float));
...
// possibly use dim3 grid(width/BLOCKSIZE,height/BLOCKSIZE), threads(BLOCKSIZE,BLOCKSIZE) ->
//kernel_wo_loop<<<grid,threads>>>(devMPPtr, devMPtr, pitch);
kernel_wo_loop<<<100,512>>>(devMPPtr, devMPtr, pitch);
// Copy back data to host
cudaMemcpy(h_out, devMPtr,memsize*sizeof(float),cudaMemcpyDeviceToHost);
...
Running the above example will result in the Cuda error: unspecified launch failure at the cudaMemcpy NOT the kernel. So my question is how do I access elements of the pitch linear memory array correctly? I’m guessing it has to do with the “<<<100,512>>>” - but I’m not sure how to set it up - should I do:
dim3 grid(width/BLOCKSIZE+1,height/BLOCKSIZE+1), threads(BLOCKSIZE,BLOCKSIZE)
(i would change the 1 to a variable but you get the idea since 50%16 !=0)
I want to avoid a texture read for now (but maybe its not possible without one- I don’t know). Somebody know how to read from a pitch linear memory region and write to a regular memory region? I can do this find with the for loop in the Programming guide, but what a waste of the gpu!