Desperate help needed! Sum rows of matrix

xtiger1 · April 22, 2011, 3:50am

Hi All, I need some help here desperately.

I wrote the following kernel:

extern "C"

__global__ void sumRok(double* matrix, double *output, unsigned int rows, unsigned int cols){

	__shared__ double tile[WARP_SIZE][2*WARP_SIZE+1];

	int rIndex = blockIdx.x * BLOCK_SIZE + threadIdx.x;

	int cIndex = blockIdx.y * WARP_SIZE + threadIdx.y;

	

	tile[threadIdx.y][threadIdx.x] = 0;

	tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

	bool alto = true;

	if (cIndex < cols)

		for (int i = 0; i < BLOCK_SIZE; i+=WARP_SIZE){

			if (rIndex+i < rows) {

				if (alto)

					tile[threadIdx.y][threadIdx.x] += matrix[rIndex+i+rows*cIndex];

				else tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] += matrix[rIndex+i+rows*cIndex];

				alto = !alto;

			} else break;

		}

	__syncthreads();

	

	

	unsigned int bound = WARP_SIZE +1;

	while (bound > 0){

		tile[threadIdx.y][threadIdx.x] += tile[threadIdx.y][threadIdx.x+bound];

		bound >>= 1;

	}

	if (threadIdx.x == 0) output[cIndex * REQ(rows, BLOCK_SIZE) + blockIdx.x] = tile[threadIdx.y][0];

}

The purpose of this kernel is to sum the rows of a matrix. There are a few things that I ought to explain:

in order to speed up, this kernel reads and adds (BLOCK_SIZE/WARP_SIZE) rows into shared memory of 2*WARP_SIZE tiles;

The shared memory is actually 2*WARP_SIZE+1 in width, with the added 1 acting as padding to avoid bank conflict;

The threads then reduce 2*WARP_SIZE doubles to a single value. Here no syncthreads is used because (I think) the threads are always synchronized in the same warp (each warp handles a row in the shared memory).

Since a lot of replies said that it is hard to understand the kernel, so here is some explanation:

suppose my warp size is 32, and I set the block size to 512. To run this kernel, I always use 32*32 blocks so that it matches the warp size.

first, a 32*65 shared memory is allocated

then each thread reads 16 entries of the matrix (32 rows apart – this read spans over 512 rows with 32 rows of threads) and add them to the shared memory. The 0th thread in the 0th row adds entries alternatively in the 0th space in the 0th row of shared memory and the 0+32+1 = 33th space. Notice the matrix is in column major format, hence to add values in the same column of the original, I need to add entries in the same row in the shared mem.

After all the read is done, I have a shared memory tile of 32*65, with the middle column acts as padding to prevent bank conflict. Each used entry in the shared memory is already a sum of 8 entries in the given matrix.

Now each warp performs reduction (i suppose no sync needed) to add 64 values into 1, stored in the first col of the shared mem. The reason 64 is chosen is to use every of the 32 threads in the first iteration of reduction (hence the alto adding scheme).

The first col of shared mem is now output.

I then tried to run the kernel with the following:

unsigned int row = 100;

	unsigned int col = 20;

	double* matrixHost;

	matrixHost = (double*) malloc(row*col*sizeof(double));

	double* matrixDevice;

	cudaMalloc(&matrixDevice, row*col*sizeof(double));

	for (int i = 0; i < row*col; i++)

		matrixHost[i] = .01*i;

	cudaMemcpy(matrixDevice, matrixHost, row*col*sizeof(double), cudaMemcpyHostToDevice);

	double* resultHost;

	resultHost = (double*)malloc(col*sizeof(double));

	double* resultDevice;

	cudaMalloc(&resultDevice, col*sizeof(double));

	sumRok<<<REQ(col, WARP_SIZE), REQ(row, BLOCK_SIZE)>>>(matrixDevice, resultDevice, row, col);

	cudaMemcpy(resultHost, resultDevice, col*sizeof(double), cudaMemcpyDeviceToHost);

	printArray(resultHost, col);

	int hold; cin >> hold;

It gives me junk. The compiler is used with arch=SM_20. win7 x64, sdk 3.2

What am I doing wrong here?

UPDATE: I took the advice and started using printf. it seems that the reading stage is fine, but then I added the following:

...

        unsigned int bound = WARP_SIZE +1;

	while (bound > 0){

		tile[threadIdx.y][threadIdx.x] += tile[threadIdx.y][threadIdx.x+bound];

		

		printf("bound=%d tile(%d,%d)=%f\n",bound, threadIdx.y, threadIdx.x,tile[threadIdx.y][threadIdx.x]);

		bound >>= 1;

	}

...

It gave me this:

bound=33 tile(0,0)=1.920000

bound=16 tile(-1,0)=131074.014116

bound=8 tile(-1,0)=-17592198672381.984000

bound=4 tile(-1,0)=-18691711096573.984000

bound=2 tile(-1,0)=-36283909900029.984000

bound=1 tile(-1,0)=149344859335460940000000.000000

-6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27

744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+

066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066,

-6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066,

Notice that tile(-1, 0). What the hell is going on?

avidday · April 22, 2011, 12:00pm

Try declaring tile as volatile.

xtiger1 · April 22, 2011, 12:05pm

I tried but it still gives junk. What does volatile do?

avidday · April 22, 2011, 12:28pm

The volatile tells the compiler that explicit load and store instructions must be generated for each access of the variable. This prevents any optimizations which would result in an intermediate result being held in register (which breaks implicit synchronization between threads in a warp).

xtiger1 · April 22, 2011, 1:16pm

Thanks Avidday, is there anything else you can think of? Anything you can see within the code?

If anyone can help me on this one, it d be greatly appreciated. I am running against a deadline and I am dead stuck

eyalhir74 · April 22, 2011, 1:19pm

Hi,

tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

....

tile[threadIdx.y][threadIdx.x] += matrix[rIndex+i+rows*cIndex];

I think this looks a bit weird. You don’t reset tile[threadIdx.y][threadIdx.x] but then use it.

try to reset like this

tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

tile[threadIdx.y][threadIdx.x] = 0;

....

xtiger1 · April 22, 2011, 1:28pm

Hi,
tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

....

tile[threadIdx.y][threadIdx.x] += matrix[rIndex+i+rows*cIndex];
I think this looks a bit weird. You don’t reset tile[threadIdx.y][threadIdx.x] but then use it.

try to reset like this
tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

tile[threadIdx.y][threadIdx.x] = 0;

....

Didn’t I reset it before I reset the tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] ?

eyalhir74 · April 22, 2011, 1:39pm

Sorry :( my bad… lack of sleep I guess… :)

xtiger1 · April 22, 2011, 8:02pm

anyone? Please??

cudesnick · April 23, 2011, 4:45am

Your kernel is too complex for me to understand, so I won’t comment on it.

Neither do I understand what you mean by malloc’ing arrays in the host memory and passing the pointers to the kernel as device global pointers. If this is indeed what you’re doing, then I would suggest that you familiarize yourself with the differences between host and device memory in CUDA, and specifically CUDA memory management functions, such as cudaMalloc, cudaMemcpy, etc.

Also, error condition must be checked after every CUDA operation on the host. This is not a matter of style, but rather a survival technique. You don’t seem to do this after the kernel launch.

xtiger1 · April 23, 2011, 12:57pm

O! You are right, my main method is wrong. How stupid. But, this is not the source of the problem; it gave me junk even before I wrote a main for this file. I was using jcuda and it gave me junk so I decided to debug it in C.

I have updated the main and, it still didn;t get me the right answer.

Since I am noob, how do I check error conditions? is there any other debugging tips you could spare me?

LSChien · April 23, 2011, 1:23pm

if you have a Fermi card, just use printf in kernel to debug.
make sure every step does what you want.

sorry, I still cannot understand your method, so cannot give you comments.

xtiger1 · April 23, 2011, 1:51pm

Thanks, i will try that. I also added some explanation in the main post, if you care to read it again.

How do I printf for multiple threads? It seems to only printf for the first thread.

cudesnick · April 23, 2011, 6:17pm

I would look into examples, that came with CUDA SDK: those normally feature error checks.

LSChien · April 24, 2011, 12:44am

please check section B.14 in the CUDA programming Guide 3.2 or 4.0

eyalhir74 · April 24, 2011, 7:05pm

Hi - is this issue still relevant? which card and CUDA version are you using?
The -1 as threadIdx.x (or y) is obviously indicating some sort of a problem in your code.
If your WARP_SIZE is 32 so as you say you allocate a 32*65 array of doubles in shared memory - this might indicate
a overflow in shared memory.
On pre-fermi (or per version 3.x I think) this issue was not checked.
Try to reduce your WARP_SIZE to 4 or 8 maybe (so that the total shared memory is at least less than 16K - remember
to multiple WARP_SIZE * ( 2 * WARP_SIZE + 1 ) * sizeof( double) ).

hope that helps,
eyal

xtiger1 · April 24, 2011, 8:08pm

Thanks. I was using 3.2 with a GTX460. After switching to 4.0, the problem seems to go away. However, as I am using jcuda, I don’t think it is compatible with 4.0 yet… that’s probably irelevant.

I will try reducing WARP_SIZE for 3.2(I think it does have a check, I have encountered compilation error of using too much before)

Topic		Replies	Views
matrix multiply reduction CUDA Programming and Performance	41	35548	January 15, 2011
help with kernel synchronization? CUDA Programming and Performance	22	13900	August 26, 2010
Warp shuffle instruction not working as expected CUDA Programming and Performance	7	816	September 6, 2023
Using Shared Memory in CUDA C/C++ Technical Blog	36	1987	October 8, 2020
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2157	February 27, 2011
Efficient use of shared memory CUDA Programming and Performance	29	4459	December 2, 2019
Problem with correct branching within a warp CUDA Programming and Performance	23	15639	May 28, 2009
Matrix Multiplication In CUDA CUDA Programming and Performance	6	2540	May 11, 2015
why result varied based on different number of threads per block? CUDA Programming and Performance	8	1940	March 1, 2011
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3405	July 25, 2009

Desperate help needed! Sum rows of matrix

Related topics