2D matrix transfer and handling problem Help required

sabkalyan · July 12, 2010, 4:19am

Hi,

I am trying to write a program which transfers a 2D matrix of size 20X100 ( all elements initialized to zero) to the device. Once in the device I am creating two blocks to handle the first 50 cols and the next 50 cols separately. I am creating 50 threads per block for handling one column each. Now I run a for loop within the kernel from 0th row to the 20th row for each thread to modify each element of the matrix by assigning a new number say 10.5. The following is the code I had written for the same using visual studio 2005. I am able to compile the program and run it successfully. However while I try to copy the modified matrix back to the host and print it, it prints the original values (all zeros). I am stuck with this problem for quite sometime. It would be of great help if some one is able to give me a solution. I am also attaching my .cu file along with this post for convenience.

The code is as follows,

#include <stdio.h>
#include <conio.h>
#include <math.h>
#include <stdlib.h>
#include <windows.h>
#include <cutil.h>

global void Matrix2D(float **dA, int maxrows, int maxcols)
{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)
{
dA[i][idx]=10.5;
printf(“dA[%d][%d] = %f \n”,i,idx,dA[i][idx]);
}
__syncthreads();

}

int main(void)

{

// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.
// CREATE A GRID OF 2 BLOCKS.
// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK
// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK
// CREATE 50 THREADS IN EACH BLOCK.
// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.


float A[20][100], *dA[20];
int maxrows=20;
int maxcols=100;

int size = maxcols*sizeof(float);

for (int i=0;i<maxrows;i++)
{
	cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device
	for(int j=0;j<maxcols;j++)
	{
		A[i][j]=0;
		printf("%f ",A[i][j]);
	}
	printf("\n");
	cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);
}


dim3 dimGrid(2,1);
dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);


for (int i=0;i<maxrows;i++)
{
	cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host
			
}

for (int i=0;i<maxrows;i++)
{
for(int j=0;j<maxcols;j++)
{
A[i][j]=0;
printf(“%f “,A[i][j]); // printing the modified array
}
printf(”\n”);
}

_getch();

return 0;

}
ARRAY2D_CUDA.cu (1.61 KB)

LSChien · July 12, 2010, 6:45am

float A[20][100], *dA[20];

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

pointer array dA is in host memory, not in device memory.

In fact, I wonder why you don’t have sgementation fault.

Pointer array is not easy to handle, why not using one-dimensional array and

compute 2-D index transformation in the kernel.

Sumit_Kumar · July 12, 2010, 7:17am

Hi,

I am trying to write a program which transfers a 2D matrix of size 20X100 ( all elements initialized to zero) to the device. Once in the device I am creating two blocks to handle the first 50 cols and the next 50 cols separately. I am creating 50 threads per block for handling one column each. Now I run a for loop within the kernel from 0th row to the 20th row for each thread to modify each element of the matrix by assigning a new number say 10.5. The following is the code I had written for the same using visual studio 2005. I am able to compile the program and run it successfully. However while I try to copy the modified matrix back to the host and print it, it prints the original values (all zeros). I am stuck with this problem for quite sometime. It would be of great help if some one is able to give me a solution. I am also attaching my .cu file along with this post for convenience.

The code is as follows,

include <stdio.h>

include <conio.h>

include <math.h>

include <stdlib.h>

include <windows.h>

include <cutil.h>

global void Matrix2D(float **dA, int maxrows, int maxcols)

{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)

{
dA[i][idx]=10.5;

printf("dA[%d][%d] = %f \n",i,idx,dA[i][idx]);
}

__syncthreads();

}

int main(void)

{
// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.

// CREATE A GRID OF 2 BLOCKS.

// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK

// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK

// CREATE 50 THREADS IN EACH BLOCK.

// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.

float A[20][100], *dA[20];

int maxrows=20;

int maxcols=100;

int size = maxcols*sizeof(float);

for (int i=0;i<maxrows;i++)

{

	cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

	for(int j=0;j<maxcols;j++)

	{

		A[i][j]=0;

		printf("%f ",A[i][j]);

	}

	printf("\n");

	cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

}

dim3 dimGrid(2,1);

dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);



for (int i=0;i<maxrows;i++)

{

	cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host

			

}
for (int i=0;i<maxrows;i++)
{

	for(int j=0;j<maxcols;j++)

	{

		A[i][j]=0;

		printf("%f ",A[i][j]); // printing the modified array

	}

	printf("\n");

}
_getch();

return 0;

}

Huh!! External Media

Nice Problem and even better solution External Media

All you need to do is have always a fresh mind before making any program.

And more important when using Copy Paste buddy External Media

Now coming to the soultion, Just look at your code and see what’s written above the printf statement of the array A External Media

Found !!

So always avoid Copy Paste as it increases you typing speed and reduces such burdens :rolleyes:

sabkalyan · July 12, 2010, 10:56am

float A[20][100], *dA[20];

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);
pointer array dA is in host memory, not in device memory.

In fact, I wonder why you don’t have sgementation fault.

Pointer array is not easy to handle, why not using one-dimensional array and

compute 2-D index transformation in the kernel.

Hi,

I have actually allocated dA in the device using the following piece of code,

and only then I have called the kernal function

for (int i=0;i<maxrows;i++)

{

cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

for(int j=0;j<maxcols;j++)

{

A[i][j]=0;

printf("%f ",A[i][j]);

}

printf(“\n”);

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

}

sabkalyan · July 12, 2010, 10:59am

Hi,

Thanks for pointing out the error !!

However I am very much aware that printf is not used in the kernal code… actually I copied and pasted the code which I had used while working in the emulator mode !!

Anyway given that this error is not there is there any other place where I am going wrong ?

LSChien · July 13, 2010, 4:05am

you declare

float *dA[20];

that means you have a pointer array dA, each element of dA[i] is a pointer, for example

then you allocate device memory by

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

so each dA[i] is address of device memory, this is O.K.

However you pass dA to kernel

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

Now driver copy content of dA to shared memory, in fact, only dA[0] is copied into shared memory.

suppose dA[0] = 0x00000000, and dA[1] = 0x100, dA[2] = 0x200, …

in your kernel, dA[0][idx]=10.5 is O.K.

But dA[1][idx] = (dA[1])[idx] = (0x00 + 4)[idx] = 0x04 + 4 * idx

but I think dA[1][idx] should be 0x100 + 4 * idx

You need to allocate a pointer array residing in device memory, for example

float **dA_hat;

cudaMalloc((void**)&dA_hat, 20*sizeof(float*));

cudaMemcpy(dA_hat, dA, 20*sizeof(float*), cudaMemcpyHostToDevice);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

sabkalyan · July 13, 2010, 5:39am

Hi,

I tried doing that too !! but again I am not able to modify the contents of my matrix. I am pasting the modified code as per your suggestions. Kindly let me know if I am making any mistakes . Thanks a lot in advance !!

include <stdio.h>

include <conio.h>

include <math.h>

include <stdlib.h>

include <windows.h>

include <cutil.h>

global void Matrix2D(float **dA_hat, int maxrows, int maxcols)

{

int idx= blockDim.x*blockIdx.x+threadIdx.x;

for(int i=0;i<maxrows;i++)

{

dA_hat[i][idx]=10.5;

}

__syncthreads();

}

int main(void)

{

// THIS IS  A PROGRAM TO CREATE A 2D MATRIX IN THE DEVICE OF SIZE 20x100 AND INITIALIZE IT TO ALL ZEROS.

// CREATE A GRID OF 2 BLOCKS.

// ASSIGN FIRST 50 COLS OF THE MATRIX TO THE FIRST BLOCK

// ASSIGN THE NEXT 50 COLS OF THE MATRIX TO THE SECOND BLOCK

// CREATE 50 THREADS IN EACH BLOCK.

// USE EACH THREAD TO ASSIGN THE NUMBER 10.5 IN EACH ELEMENT OF THE MATRIX.

float A[20][100],**dA_hat;

int maxrows=20;

int maxcols=100;

    int size_hat = maxrows *sizeof(float*);

cudaMalloc((void**)&dA_hat, size_hat);

for (int i=0;i<maxrows;i++)

{

	

	for(int j=0;j<maxcols;j++)

	{

		A[i][j]=0;

	}

	

}

printf("initial values of A matrix set to zeros\n");

cudaMemcpy(dA_hat,A,size_hat,cudaMemcpyHostToDevice);

dim3 dimGrid(2,1);

dim3 dimBlock(50,1);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);



cudaMemcpy(A,dA_hat,size_hat,cudaMemcpyDeviceToHost);

for (int i=0;i<maxrows;i++)

{

	for(int j=0;j<maxcols;j++)

	{

		printf("%f ",A[i][j]); // printing the modified array

	}

	printf("\n");

}

_getch();

return 0;

}

you declare

float *dA[20];

that means you have a pointer array dA, each element of dA[i] is a pointer, for example

then you allocate device memory by

cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

so each dA[i] is address of device memory, this is O.K.

However you pass dA to kernel

Matrix2D<<<dimGrid,dimBlock>>>(dA,maxrows,maxcols);

Now driver copy content of dA to shared memory, in fact, only dA[0] is copied into shared memory.

suppose dA[0] = 0x00000000, and dA[1] = 0x100, dA[2] = 0x200, …

in your kernel, dA[0][idx]=10.5 is O.K.

But dA[1][idx] = (dA[1])[idx] = (0x00 + 4)[idx] = 0x04 + 4 * idx

but I think dA[1][idx] should be 0x100 + 4 * idx

You need to allocate a pointer array residing in device memory, for example
float **dA_hat;

cudaMalloc((void**)&dA_hat, 20*sizeof(float*));

cudaMemcpy(dA_hat, dA, 20*sizeof(float*), cudaMemcpyHostToDevice);

Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

LSChien · July 13, 2010, 6:13am

Try following code

#include <stdio.h>

#include <conio.h>

#include <math.h>

#include <stdlib.h>

#include <cutil.h>

__global__ void Matrix2D(float **dA, int maxrows, int maxcols)

{

	int idx= blockDim.x*blockIdx.x+threadIdx.x;

	for(int i=0;i<maxrows;i++){

	  if ( idx < maxcols ){

		   dA[i][idx]= i + idx;

		} 

		//printf("dA[%d][%d] = %f \n",i,idx,dA[i][idx]);

	}

//	__syncthreads();

}

int main(void)

{

	float A[20][100], *dA[20];

	int maxrows=10;

	int maxcols=10;

	int size = maxcols*sizeof(float);

	for (int i=0;i<maxrows;i++){

		cudaMalloc((void**)&dA[i], size); // for every i from 0 to 19, one 1D array is allocated in the device

		for(int j=0;j<maxcols;j++){

			A[i][j]=0;

			//printf("%f ",A[i][j]);

		}

		printf("\n");

		cudaMemcpy(dA[i],A[i],size,cudaMemcpyHostToDevice);

	}

float **dA_hat;

  cudaMalloc((void**)&dA_hat, maxrows * sizeof(float*) ); 

  cudaMemcpy(dA_hat,dA, maxrows * sizeof(float*), cudaMemcpyHostToDevice);

	dim3 dimGrid(2,1);

	dim3 dimBlock(50,1);

	Matrix2D<<<dimGrid,dimBlock>>>(dA_hat,maxrows,maxcols);

	for (int i=0;i<maxrows;i++){

		cudaMemcpy(A[i],dA[i],size,cudaMemcpyDeviceToHost); // copying back the modified matrix from the device to host

	}

	for (int i=0;i<maxrows;i++){

		for(int j=0;j<maxcols;j++){

			printf("A[%d][%d] = %f \n", i, j , A[i][j]); // printing the modified array

		}

		printf("\n");

	}

	return 0;

}

Topic		Replies	Views
Help with cuda 2d array CUDA Programming and Performance	6	7543	September 29, 2014
2D Array Not Updated CUDA Programming and Performance	6	5349	May 4, 2010
2d array testing in very simple code using CUDA CUDA Programming and Performance	29	30637	November 15, 2010
How to cudaMalloc two-dimensional array ? CUDA Programming and Performance	46	67535	September 7, 2023
CUDA 2D Array Problem Need help to manipulate 2D arrays in CUDA CUDA Programming and Performance	4	26539	March 17, 2011
Copying 2D array from host to device CUDA Programming and Performance	7	7388	July 27, 2010
2d matrix passing values help with this code CUDA Programming and Performance	4	3290	November 10, 2010
multi dimension array CUDA Programming and Performance	26	33101	February 12, 2010
help cudaMemcpy2d Trying to modify a 2d array on cuda device CUDA Programming and Performance	8	5111	September 11, 2010
Two Dimensional Array on Device using Pointers CUDA Programming and Performance	0	429	December 22, 2017

2D matrix transfer and handling problem Help required

Related topics