2D matrix transfer and handling problem Help required

Hi,

I am trying to write a program which transfers a 2D matrix of size 20X100 ( all elements initialized to zero) to the device. Once in the device I am creating two blocks to handle the first 50 cols and the next 50 cols separately. I am creating 50 threads per block for handling one column each. Now I run a for loop within the kernel from 0th row to the 20th row for each thread to modify each element of the matrix by assigning a new number say 10.5. The following is the code I had written for the same using visual studio 2005. I am able to compile the program and run it successfully. However while I try to copy the modified matrix back to the host and print it, it prints the original values (all zeros). I am stuck with this problem for quite sometime. It would be of great help if some one is able to give me a solution. I am also attaching my .cu file along with this post for convenience.

The code is as follows,

#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <stdlib.h>
#include <windows.h>
#include <cutil.h>

global void Matrix2D(float **dA, int maxrows, int maxcols)
{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)
{
dA[i][idx]=10.5;
printf(“dA[%d][%d] = %f \n”,i,idx,dA[i][idx]);
}
__syncthreads();

}

int main(void)

{

// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.
// CREATE A GRID OF 2 BLOCKS.
// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK
// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK
// CREATE 50 THREADS IN EACH BLOCK.
// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.


float A[20][100], *dA[20];
int maxrows=20;
int maxcols=100;

int size = maxcols*sizeof(float);

for (int i=0;i<maxrows;i++)
{
	cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device
	for(int j=0;j<maxcols;j++)
	{
		A[i][j]=0;
		printf("%f ",A[i][j]);
	}
	printf("\n");
	cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);
}


dim3 dimGrid(2,1);
dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);


for (int i=0;i<maxrows;i++)
{
	cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host
			
}

for (int i=0;i<maxrows;i++)
{
for(int j=0;j<maxcols;j++)
{
A[i][j]=0;
printf(“%f “,A[i][j]); // printing the modified array
}
printf(”\n”);
}

_getch();

return 0;

}
ARRAY2D_CUDA.cu (1.61 KB)

float A[20][100], *dA[20];

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

pointer array dA is in host memory, not in device memory.

In fact, I wonder why you don’t have sgementation fault.

Pointer array is not easy to handle, why not using one-dimensional array and

compute 2-D index transformation in the kernel.

Huh!! External Media

Nice Problem and even better solution External Media

All you need to do is have always a fresh mind before making any program.

And more important when using Copy Paste buddy External Media

Now coming to the soultion, Just look at your code and see what’s written above the printf statement of the array A External Media

Found !!

So always avoid Copy Paste as it increases you typing speed and reduces such burdens :rolleyes:

Hi,

I have actually allocated dA in the device using the following piece of code,

and only then I have called the kernal function

for (int i=0;i<maxrows;i++)

{

cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

for(int j=0;j<maxcols;j++)

{

A[i][j]=0;

printf("%f ",A[i][j]);

}

printf(“\n”);

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

}

Hi,

Thanks for pointing out the error !!

However I am very much aware that printf is not used in the kernal code… actually I copied and pasted the code which I had used while working in the emulator mode !!

Anyway given that this error is not there is there any other place where I am going wrong ?

you declare

float *dA[20];

that means you have a pointer array dA, each element of dA[i] is a pointer, for example

then you allocate device memory by

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

so each dA[i] is address of device memory, this is O.K.

However you pass dA to kernel

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

Now driver copy content of dA to shared memory, in fact, only dA[0] is copied into shared memory.

suppose dA[0] = 0x00000000, and dA[1] = 0x100, dA[2] = 0x200, …

in your kernel, dA[0][idx]=10.5 is O.K.

But dA[1][idx] = (dA[1])[idx] = (0x00 + 4)[idx] = 0x04 + 4 * idx

but I think dA[1][idx] should be 0x100 + 4 * idx

You need to allocate a pointer array residing in device memory, for example

float **dA_hat;

cudaMalloc((void**)&dA_hat, 20*sizeof(float*));

cudaMemcpy(dA_hat, dA, 20*sizeof(float*), cudaMemcpyHostToDevice);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

Hi,

I tried doing that too !! but again I am not able to modify the contents of my matrix. I am pasting the modified code as per your suggestions. Kindly let me know if I am making any mistakes . Thanks a lot in advance !!

include <stdio.h>

include <conio.h>

include <math.h>

include <stdlib.h>

include <windows.h>

include <cutil.h>

global void Matrix2D(float **dA_hat, int maxrows, int maxcols)

{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)

{

dA_hat[i][idx]=10.5;

}

__syncthreads();

}

int main(void)

{

// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.

// CREATE A GRID OF 2 BLOCKS.

// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK

// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK

// CREATE 50 THREADS IN EACH BLOCK.

// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.

float A[20][100],**dA_hat;

int maxrows=20;

int maxcols=100;

    int size_hat = maxrows *sizeof(float*);

cudaMalloc((void**)&dA_hat, size_hat);

for (int i=0;i<maxrows;i++)

{

	

	for(int j=0;j<maxcols;j++)

	{

		A[i][j]=0;

	}

	

}

printf("initial values of A matrix set to zeros\n");

cudaMemcpy(dA_hat,A,size_hat,cudaMemcpyHostToDevice);

dim3 dimGrid(2,1);

dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);



cudaMemcpy(A,dA_hat,size_hat,cudaMemcpyDeviceToHost);

for (int i=0;i<maxrows;i++)

{

	for(int j=0;j<maxcols;j++)

	{

		printf("%f ",A[i][j]); // printing the modified array

	}

	printf("\n");

}

_getch();

return 0;

}

Try following code

#include <stdio.h>

#include <conio.h>

#include <math.h>

#include <stdlib.h>

#include <cutil.h>

__global__ void Matrix2D(float **dA, int maxrows, int maxcols)

{

	int idx= blockDim.x*blockIdx.x+threadIdx.x;

	for(int i=0;i<maxrows;i++){

	  if ( idx < maxcols ){

		   dA[i][idx]= i + idx;

		} 

		//printf("dA[%d][%d] = %f \n",i,idx,dA[i][idx]);

	}

//	__syncthreads();

}

int main(void)

{

	float A[20][100], *dA[20];

	int maxrows=10;

	int maxcols=10;

	int size = maxcols*sizeof(float);

	for (int i=0;i<maxrows;i++){

		cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

		for(int j=0;j<maxcols;j++){

			A[i][j]=0;

			//printf("%f ",A[i][j]);

		}

		printf("\n");

		cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

	}

float **dA_hat;

  cudaMalloc((void**)&dA_hat, maxrows * sizeof(float*) ); 

  cudaMemcpy(dA_hat,dA, maxrows * sizeof(float*), cudaMemcpyHostToDevice);

	dim3 dimGrid(2,1);

	dim3 dimBlock(50,1);

	Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

	for (int i=0;i<maxrows;i++){

		cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host

	}

	for (int i=0;i<maxrows;i++){

		for(int j=0;j<maxcols;j++){

			printf("A[%d][%d] = %f \n", i, j , A[i][j]); // printing the modified array

		}

		printf("\n");

	}

	return 0;

}