problem with cudaMallocPitch and cudaMemcpy2D

Hello to All,

I am trying to make some matrix computation, and I am using cudaMemcpy2D and cudaMallocPitch.

Since I am having some trouble, I developed a simple kernel, which copy a matrix into another.

Here it is the code:

[codebox]global void matrixCopy(float* a, float* c, int a_pitch, int c_pitch, int width)

{

int x = blockIdx.x*blockDim.x+threadIdx.x;

int y = blockIdx.y*blockDim.y+threadIdx.y;

c[yc_pitch+x]=a[ya_pitch+x];

}

void matrixCopyCaller(float* hA, float* hC, int width, float& compute_time)

{

float* dA;

size_t matA_gpu_pitch;

float* dC;

size_t matC_gpu_pitch;

//int memsize=widthwidthsizeof(float);

unsigned int timer = 0;

cudaSetDevice(0);

cudaMallocPitch((void**)&dA,&matA_gpu_pitch, width*sizeof(float),width);

cudaMemcpy2D(dA, matA_gpu_pitch, hA, widthsizeof(float), widthsizeof(float),width,cudaMemcpyHostToDevice);

cudaMallocPitch((void**)&dC,&matC_gpu_pitch, width*sizeof(float),width);

dim3 block(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(width/block.x,width/block.y);

matrixMul<<<grid,block>>>(dA,dB,dC, matA_gpu_pitch, matB_gpu_pitch, matC_gpu_pitch, width);

cudaThreadSynchronize();

cudaMemcpy2D(hC, widthsizeof(float), dC, matC_gpu_pitch, widthsizeof(float), width, cudaMemcpyDeviceToHost);

cudaFree(dA);

cudaFree(dC);

}

[/codebox]

could you explain me why it does not work?

Thanks in advance

Francesco

Hi,

please be a bit more precise on WHAT is going wrong! Are there compiler errors, linker errors, runtime errors or are the results seemingly corrupt? :unsure:

all the elements of the C matrix are 0

Hi,

the name of the kernel you want to test is “matrixCopy” :

[codebox]global void matrixCopy(float* a, float* c, int a_pitch, int c_pitch, int width)[/codebox]

however in your code you don’t call this kernel but :

[codebox]matrixMul<<<grid,block>>>(dA,dB,dC, matA_gpu_pitch, matB_gpu_pitch, matC_gpu_pitch, width);[/codebox]

your problem might be linked to this ?

Maël

no, that line was due to a wrong copy&paste…I apologize :(

this is the real code:

[codebox]define BLOCK_SIZE 16

global void matrixCopy(float* a, float* c, int a_pitch, int c_pitch, int width)

{

int x = blockIdx.x*blockDim.x+threadIdx.x;

int y = blockIdx.y*blockDim.y+threadIdx.y;

c[yc_pitch+x]=a[ya_pitch+x];

}

void matrixCopyCaller(float* hA, float* hC, int width, float& compute_time)

{

float* dA;

size_t matA_gpu_pitch;

float* dC;

size_t matC_gpu_pitch;

//int memsize=widthwidthsizeof(float);

unsigned int timer = 0;

cudaSetDevice(0);

cudaMallocPitch((void**)&dA,&matA_gpu_pitch, width*sizeof(float),width);

cudaMemcpy2D(dA, matA_gpu_pitch, hA, widthsizeof(float), widthsizeof(float),width,cudaMemcpyHostToDevice);

cudaMallocPitch((void**)&dC,&matC_gpu_pitch, width*sizeof(float),width);

dim3 block(BLOCK_SIZE, BLOCK_SIZE);

dim3 grid(width/block.x,width/block.y);

matrixCopy<<<grid,block>>>(dA,dC, matA_gpu_pitch, matC_gpu_pitch, width);

cudaThreadSynchronize();

cudaMemcpy2D(hC, widthsizeof(float), dC, matC_gpu_pitch, widthsizeof(float), width, cudaMemcpyDeviceToHost);

cudaFree(dA);

cudaFree(dC);

}

[/codebox]

only that line is different.

the code compiles and runs, but all the elements of the matrix C are 0

thanks a lot to every one

I apologize again for my wrong post, but there was only one line wrong.

cheers

Francesco

solved:

the error was that the pitch arguments for the kernel function must be in number of elements of the kernel,

and not in bytes, therefore you must divide for sizeof(float).

Here it is the right kernel call

[codebox]matrixCopy<<<grid,block>>>(dA,dC, matA_gpu_pitch/sizeof(float), matC_gpu_pitch/sizeof(float), width);

[/codebox]