Desperate help needed! Sum rows of matrix

Hi All, I need some help here desperately.

I wrote the following kernel:

extern "C"

__global__ void sumRok(double* matrix, double *output, unsigned int rows, unsigned int cols){

	__shared__ double tile[WARP_SIZE][2*WARP_SIZE+1];

	int rIndex = blockIdx.x * BLOCK_SIZE + threadIdx.x;

	int cIndex = blockIdx.y * WARP_SIZE + threadIdx.y;

	

	tile[threadIdx.y][threadIdx.x] = 0;

	tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

	bool alto = true;

	if (cIndex < cols)

		for (int i = 0; i < BLOCK_SIZE; i+=WARP_SIZE){

			if (rIndex+i < rows) {

				if (alto)

					tile[threadIdx.y][threadIdx.x] += matrix[rIndex+i+rows*cIndex];

				else tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] += matrix[rIndex+i+rows*cIndex];

				alto = !alto;

			} else break;

		}

	__syncthreads();

	

	

	unsigned int bound = WARP_SIZE +1;

	while (bound > 0){

		tile[threadIdx.y][threadIdx.x] += tile[threadIdx.y][threadIdx.x+bound];

		bound >>= 1;

	}

	if (threadIdx.x == 0) output[cIndex * REQ(rows, BLOCK_SIZE) + blockIdx.x] = tile[threadIdx.y][0];

}

The purpose of this kernel is to sum the rows of a matrix. There are a few things that I ought to explain:

in order to speed up, this kernel reads and adds (BLOCK_SIZE/WARP_SIZE) rows into shared memory of 2*WARP_SIZE tiles;

The shared memory is actually 2*WARP_SIZE+1 in width, with the added 1 acting as padding to avoid bank conflict;

The threads then reduce 2*WARP_SIZE doubles to a single value. Here no syncthreads is used because (I think) the threads are always synchronized in the same warp (each warp handles a row in the shared memory).

Since a lot of replies said that it is hard to understand the kernel, so here is some explanation:

suppose my warp size is 32, and I set the block size to 512. To run this kernel, I always use 32*32 blocks so that it matches the warp size.

first, a 32*65 shared memory is allocated

then each thread reads 16 entries of the matrix (32 rows apart – this read spans over 512 rows with 32 rows of threads) and add them to the shared memory. The 0th thread in the 0th row adds entries alternatively in the 0th space in the 0th row of shared memory and the 0+32+1 = 33th space. Notice the matrix is in column major format, hence to add values in the same column of the original, I need to add entries in the same row in the shared mem.

After all the read is done, I have a shared memory tile of 32*65, with the middle column acts as padding to prevent bank conflict. Each used entry in the shared memory is already a sum of 8 entries in the given matrix.

Now each warp performs reduction (i suppose no sync needed) to add 64 values into 1, stored in the first col of the shared mem. The reason 64 is chosen is to use every of the 32 threads in the first iteration of reduction (hence the alto adding scheme).

The first col of shared mem is now output.

I then tried to run the kernel with the following:

unsigned int row = 100;

	unsigned int col = 20;

	double* matrixHost;

	matrixHost = (double*) malloc(row*col*sizeof(double));

	double* matrixDevice;

	cudaMalloc(&matrixDevice, row*col*sizeof(double));

	for (int i = 0; i < row*col; i++)

		matrixHost[i] = .01*i;

	cudaMemcpy(matrixDevice, matrixHost, row*col*sizeof(double), cudaMemcpyHostToDevice);

	double* resultHost;

	resultHost = (double*)malloc(col*sizeof(double));

	double* resultDevice;

	cudaMalloc(&resultDevice, col*sizeof(double));

	sumRok<<<REQ(col, WARP_SIZE), REQ(row, BLOCK_SIZE)>>>(matrixDevice, resultDevice, row, col);

	cudaMemcpy(resultHost, resultDevice, col*sizeof(double), cudaMemcpyDeviceToHost);

	printArray(resultHost, col);

	int hold; cin >> hold;

It gives me junk. The compiler is used with arch=SM_20. win7 x64, sdk 3.2

What am I doing wrong here?

UPDATE: I took the advice and started using printf. it seems that the reading stage is fine, but then I added the following:

...

        unsigned int bound = WARP_SIZE +1;

	while (bound > 0){

		tile[threadIdx.y][threadIdx.x] += tile[threadIdx.y][threadIdx.x+bound];

		

		printf("bound=%d tile(%d,%d)=%f\n",bound, threadIdx.y, threadIdx.x,tile[threadIdx.y][threadIdx.x]);

		bound >>= 1;

	}

...

It gave me this:

bound=33 tile(0,0)=1.920000

bound=16 tile(-1,0)=131074.014116

bound=8 tile(-1,0)=-17592198672381.984000

bound=4 tile(-1,0)=-18691711096573.984000

bound=2 tile(-1,0)=-36283909900029.984000

bound=1 tile(-1,0)=149344859335460940000000.000000

-6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27

744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+

066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066,

-6.27744e+066, -6.27744e+066, -6.27744e+066, -6.27744e+066,

Notice that tile(-1, 0). What the hell is going on?

Try declaring tile as volatile.

I tried but it still gives junk. What does volatile do?

The volatile tells the compiler that explicit load and store instructions must be generated for each access of the variable. This prevents any optimizations which would result in an intermediate result being held in register (which breaks implicit synchronization between threads in a warp).

Thanks Avidday, is there anything else you can think of? Anything you can see within the code?

If anyone can help me on this one, it d be greatly appreciated. I am running against a deadline and I am dead stuck

Hi,

tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

....

tile[threadIdx.y][threadIdx.x] += matrix[rIndex+i+rows*cIndex];

I think this looks a bit weird. You don’t reset tile[threadIdx.y][threadIdx.x] but then use it.

try to reset like this

tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] = 0;

tile[threadIdx.y][threadIdx.x] = 0;

....

Didn’t I reset it before I reset the tile[threadIdx.y][threadIdx.x + (WARP_SIZE +1)] ?

Sorry :( my bad… lack of sleep I guess… :)

anyone? Please??

Your kernel is too complex for me to understand, so I won’t comment on it.

Neither do I understand what you mean by malloc’ing arrays in the host memory and passing the pointers to the kernel as device global pointers. If this is indeed what you’re doing, then I would suggest that you familiarize yourself with the differences between host and device memory in CUDA, and specifically CUDA memory management functions, such as cudaMalloc, cudaMemcpy, etc.

Also, error condition must be checked after every CUDA operation on the host. This is not a matter of style, but rather a survival technique. You don’t seem to do this after the kernel launch.

O! You are right, my main method is wrong. How stupid. But, this is not the source of the problem; it gave me junk even before I wrote a main for this file. I was using jcuda and it gave me junk so I decided to debug it in C.

I have updated the main and, it still didn;t get me the right answer.

Since I am noob, how do I check error conditions? is there any other debugging tips you could spare me?

if you have a Fermi card, just use printf in kernel to debug.
make sure every step does what you want.

sorry, I still cannot understand your method, so cannot give you comments.

Thanks, i will try that. I also added some explanation in the main post, if you care to read it again.

How do I printf for multiple threads? It seems to only printf for the first thread.

I would look into examples, that came with CUDA SDK: those normally feature error checks.

please check section B.14 in the CUDA programming Guide 3.2 or 4.0

Hi - is this issue still relevant? which card and CUDA version are you using?
The -1 as threadIdx.x (or y) is obviously indicating some sort of a problem in your code.
If your WARP_SIZE is 32 so as you say you allocate a 32*65 array of doubles in shared memory - this might indicate
a overflow in shared memory.
On pre-fermi (or per version 3.x I think) this issue was not checked.
Try to reduce your WARP_SIZE to 4 or 8 maybe (so that the total shared memory is at least less than 16K - remember
to multiple WARP_SIZE * ( 2 * WARP_SIZE + 1 ) * sizeof( double) ).

hope that helps,
eyal

Thanks. I was using 3.2 with a GTX460. After switching to 4.0, the problem seems to go away. However, as I am using jcuda, I don’t think it is compatible with 4.0 yet… that’s probably irelevant.

I will try reducing WARP_SIZE for 3.2(I think it does have a check, I have encountered compilation error of using too much before)