NVIDIA Developer Forums

Retrieve array columns

Accelerated Computing CUDA CUDA Programming and Performance

kelson October 24, 2010, 7:43am 1

I’m trying to retrieve each column from a 2D array

into a shared memory 1D array.

The program works well but after analysis it seems

that it takes too long time to get the work done.

I would appreciate any help to solve this problem because

it slow the overall execution of my application

Here is what I did:

__device__ void getCols(float *_iarray, int row, int col)

{

	__shared__ float tmpCol[BLOCK_SIZE];

	int tid = threadIdx.x;

	int stride = blockDim.x * col;

	int tmpV;

	int size = row * col;

	for(int j = 0; j < col; j++)

	{

		//(re)initialize tmpCol for the next column

		tmpCol[tid] = 0;

		__syncthreads();		

		//get the jth col

		for(int k = 0; k < size; k += stride)

		{

			tmpV = tid * col + j + k;

			if(tmpV < size){

				tmpCol[tid] += _iarray[tmpV];	

			}	

		}

		//do somthing with tmpCol;

	}

}

kelson October 24, 2010, 7:43am 2

I’m trying to retrieve each column from a 2D array

into a shared memory 1D array.

The program works well but after analysis it seems

that it takes too long time to get the work done.

I would appreciate any help to solve this problem because

it slow the overall execution of my application

Here is what I did:

__device__ void getCols(float *_iarray, int row, int col)

{

	__shared__ float tmpCol[BLOCK_SIZE];

	int tid = threadIdx.x;

	int stride = blockDim.x * col;

	int tmpV;

	int size = row * col;

	for(int j = 0; j < col; j++)

	{

		//(re)initialize tmpCol for the next column

		tmpCol[tid] = 0;

		__syncthreads();		

		//get the jth col

		for(int k = 0; k < size; k += stride)

		{

			tmpV = tid * col + j + k;

			if(tmpV < size){

				tmpCol[tid] += _iarray[tmpV];	

			}	

		}

		//do somthing with tmpCol;

	}

}

Lev December 9, 2010, 1:26am 3

I’m trying to retrieve each column from a 2D array

into a shared memory 1D array.

The program works well but after analysis it seems

that it takes too long time to get the work done.

I would appreciate any help to solve this problem because

it slow the overall execution of my application

Here is what I did:
__device__ void getCols(float *_iarray, int row, int col)

{

	__shared__ float tmpCol[BLOCK_SIZE];

	int tid = threadIdx.x;

	int stride = blockDim.x * col;

	int tmpV;

	int size = row * col;

	for(int j = 0; j < col; j++)

	{

		//(re)initialize tmpCol for the next column

		tmpCol[tid] = 0;

		__syncthreads();		

		//get the jth col

		for(int k = 0; k < size; k += stride)

		{

			tmpV = tid * col + j + k;

			if(tmpV < size){

				tmpCol[tid] += _iarray[tmpV];	

			}	

		}

		//do somthing with tmpCol;

	}

}

Transpose your array. It maybe better to ask at cuda programming forum. Or did you resolve the problem by now?

Topic		Replies	Views	Activity
Retrieve array columns CUDA Programming and Performance	5	797	October 27, 2010
Compute column means and variances in large 2D array using CUDA/MATLAB -- PLEASE HELP! CUDA Programming and Performance	3	5050	June 21, 2009
2D Coalesced access pattern CUDA Programming and Performance	4	3045	October 24, 2007
Shifted copy of Vector CUDA Programming and Performance	1	3190	March 2, 2010
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1867	November 18, 2009
GM2=GM1 is faster than "SM=GM1; GM2=SM;" ? memory access time CUDA Programming and Performance	10	5445	April 19, 2007
how to use shared memory CUDA Programming and Performance	6	7751	September 5, 2010
shared memory in 1D array operations CUDA Programming and Performance	2	3669	May 19, 2008
Matrix column in shared memory CUDA Programming and Performance	2	798	February 23, 2017
How to access 2D Matrix in 1D using cudaMalloc() ? CUDA Programming and Performance	2	1724	February 21, 2014