2D matrix transfer and handling problem Help required

Hi,

I am trying to write a program which transfers a 2D matrix of size 20X100 ( all elements initialized to zero) to the device. Once in the device I am creating two blocks to handle the first 50 cols and the next 50 cols separately. I am creating 50 threads per block for handling one column each. Now I run a for loop within the kernel from 0th row to the 20th row for each thread to modify each element of the matrix by assigning a new number say 10.5. The following is the code I had written for the same using visual studio 2005. I am able to compile the program and run it successfully. However while I try to copy the modified matrix back to the host and print it, it prints the original values (all zeros). I am stuck with this problem for quite sometime. It would be of great help if some one is able to give me a solution. I am also attaching my .cu file along with this post for convenience.

The code is as follows,

#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <stdlib.h>
#include <windows.h>
#include <cutil.h>

global void Matrix2D(float **dA, int maxrows, int maxcols)
{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)
{
dA[i][idx]=10.5;
printf(“dA[%d][%d] = %f \n”,i,idx,dA[i][idx]);
}
__syncthreads();

}

int main(void)

{

// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.
// CREATE A GRID OF 2 BLOCKS.
// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK
// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK
// CREATE 50 THREADS IN EACH BLOCK.
// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.


float A[20][100], *dA[20];
int maxrows=20;
int maxcols=100;

int size = maxcols*sizeof(float);

for (int i=0;i<maxrows;i++)
{
	cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device
	for(int j=0;j<maxcols;j++)
	{
		A[i][j]=0;
		printf("%f ",A[i][j]);
	}
	printf("\n");
	cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);
}


dim3 dimGrid(2,1);
dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);


for (int i=0;i<maxrows;i++)
{
	cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host
			
}

for (int i=0;i<maxrows;i++)
{
for(int j=0;j<maxcols;j++)
{
A[i][j]=0;
printf("%f “,A[i][j]); // printing the modified array
}
printf(”\n");
}

_getch();

return 0;

}
ARRAY2D_CUDA.cu (1.61 KB)

float A[20][100], *dA[20];

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

pointer array dA is in host memory, not in device memory.

In fact, I wonder why you don’t have sgementation fault.

Pointer array is not easy to handle, why not using one-dimensional array and

compute 2-D index transformation in the kernel.

Huh!! :turned:

Nice Problem and even better solution :shifty:

All you need to do is have always a fresh mind before making any program.

And more important when using Copy Paste buddy :shifty:

Now coming to the soultion, Just look at your code and see what’s written above the printf statement of the array A :shifty:

Found !!

So always avoid Copy Paste as it increases you typing speed and reduces such burdens :rolleyes:

Hi,

I have actually allocated dA in the device using the following piece of code,

and only then I have called the kernal function

for (int i=0;i<maxrows;i++)

{

cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

for(int j=0;j<maxcols;j++)

{

A[i][j]=0;

printf("%f ",A[i][j]);

}

printf("\n");

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

}

Hi,

Thanks for pointing out the error !!

However I am very much aware that printf is not used in the kernal code… actually I copied and pasted the code which I had used while working in the emulator mode !!

Anyway given that this error is not there is there any other place where I am going wrong ?

you declare

float *dA[20];

that means you have a pointer array dA, each element of dA[i] is a pointer, for example

then you allocate device memory by

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

so each dA[i] is address of device memory, this is O.K.

However you pass dA to kernel

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

Now driver copy content of dA to shared memory, in fact, only dA[0] is copied into shared memory.

suppose dA[0] = 0x00000000, and dA[1] = 0x100, dA[2] = 0x200, …

in your kernel, dA[0][idx]=10.5 is O.K.

But dA[1][idx] = (dA[1])[idx] = (0x00 + 4)[idx] = 0x04 + 4 * idx

but I think dA[1][idx] should be 0x100 + 4 * idx

You need to allocate a pointer array residing in device memory, for example

float **dA_hat;

cudaMalloc((void**)&dA_hat, 20*sizeof(float*));

cudaMemcpy(dA_hat, dA, 20*sizeof(float*), cudaMemcpyHostToDevice);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

Hi,

I tried doing that too !! but again I am not able to modify the contents of my matrix. I am pasting the modified code as per your suggestions. Kindly let me know if I am making any mistakes . Thanks a lot in advance !!

#include <stdio.h>

#include <conio.h>

#include <math.h>

#include <stdlib.h>

#include <windows.h>

#include <cutil.h>

global void Matrix2D(float **dA_hat, int maxrows, int maxcols)

{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)

{

dA_hat[i][idx]=10.5;

}

__syncthreads();

}

int main(void)

{

// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.

// CREATE A GRID OF 2 BLOCKS.

// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK

// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK

// CREATE 50 THREADS IN EACH BLOCK.

// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.

float A[20][100],**dA_hat;

int maxrows=20;

int maxcols=100;

    int size_hat = maxrows *sizeof(float*);

cudaMalloc((void**)&dA_hat, size_hat);

for (int i=0;i<maxrows;i++)

{

	

	for(int j=0;j<maxcols;j++)

	{

		A[i][j]=0;

	}

	

}

printf("initial values of A matrix set to zeros\n");

cudaMemcpy(dA_hat,A,size_hat,cudaMemcpyHostToDevice);

dim3 dimGrid(2,1);

dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);



cudaMemcpy(A,dA_hat,size_hat,cudaMemcpyDeviceToHost);

for (int i=0;i<maxrows;i++)

{

	for(int j=0;j<maxcols;j++)

	{

		printf("%f ",A[i][j]); // printing the modified array

	}

	printf("\n");

}

_getch();

return 0;

}

Try following code

#include <stdio.h>

#include <conio.h>

#include <math.h>

#include <stdlib.h>

#include <cutil.h>

__global__ void Matrix2D(float **dA, int maxrows, int maxcols)

{

	int idx= blockDim.x*blockIdx.x+threadIdx.x;

	for(int i=0;i<maxrows;i++){

	  if ( idx < maxcols ){

		   dA[i][idx]= i + idx;

		} 

		//printf("dA[%d][%d] = %f \n",i,idx,dA[i][idx]);

	}

//	__syncthreads();

}

int main(void)

{

	float A[20][100], *dA[20];

	int maxrows=10;

	int maxcols=10;

	int size = maxcols*sizeof(float);

	for (int i=0;i<maxrows;i++){

		cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

		for(int j=0;j<maxcols;j++){

			A[i][j]=0;

			//printf("%f ",A[i][j]);

		}

		printf("\n");

		cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

	}

float **dA_hat;

  cudaMalloc((void**)&dA_hat, maxrows * sizeof(float*) ); 

  cudaMemcpy(dA_hat,dA, maxrows * sizeof(float*), cudaMemcpyHostToDevice);

	dim3 dimGrid(2,1);

	dim3 dimBlock(50,1);

	Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

	for (int i=0;i<maxrows;i++){

		cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host

	}

	for (int i=0;i<maxrows;i++){

		for(int j=0;j<maxcols;j++){

			printf("A[%d][%d] = %f \n", i, j , A[i][j]); // printing the modified array

		}

		printf("\n");

	}

	return 0;

}