matrix multiply reduction

vivekv80 · November 11, 2010, 10:18pm

I am trying to multiply two arrays x & y of height 150 with a matrix of width 1313 and height 300. after multiplying the rows of the matrix with x and y, I need to sum each of the columns resulting in a 1313 by 1

After going through the reduction and the matrix examples in the SDK,

I tried doing this:

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //num_blocks = 83, threads = 16, 83*16=1328

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*els_per_block; //els_per_block = 150

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

		

	if(threadIdx.x == 0)

	{

 		out[blockIdx.x] = val; 

	}

}

Two questions:

When I debug the kernel, the variable val has the correct summed value of all the columns in the matrix. But when the execution reaches the line “out[blockIdx.x] = val;” I see the following error:

warning: single stepping a divergent thread.

Single stepping the warp until the divergent thread becomes active.  If the thread never becomes active, this operation may not complete.

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

currently, I am using only 1 thread to do the multiply, add. (the profiler shows that this takes 225 us. If I use all the 16 threads to do the task then, do I need to instantiate 16 arrays to store the intermediate summed values?

suggestions? thanks in advance.

vivekv80 · November 11, 2010, 10:18pm

I am trying to multiply two arrays x & y of height 150 with a matrix of width 1313 and height 300. after multiplying the rows of the matrix with x and y, I need to sum each of the columns resulting in a 1313 by 1

After going through the reduction and the matrix examples in the SDK,

I tried doing this:

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //num_blocks = 83, threads = 16, 83*16=1328

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*els_per_block; //els_per_block = 150

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

		

	if(threadIdx.x == 0)

	{

 		out[blockIdx.x] = val; 

	}

}

Two questions:

When I debug the kernel, the variable val has the correct summed value of all the columns in the matrix. But when the execution reaches the line “out[blockIdx.x] = val;” I see the following error:

warning: single stepping a divergent thread.

Single stepping the warp until the divergent thread becomes active.  If the thread never becomes active, this operation may not complete.

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

currently, I am using only 1 thread to do the multiply, add. (the profiler shows that this takes 225 us. If I use all the 16 threads to do the task then, do I need to instantiate 16 arrays to store the intermediate summed values?

suggestions? thanks in advance.

avidday · November 11, 2010, 11:33pm

Whether that code works or not (and I doubt it actually does what you think it does), by using one thread per block you are limiting yourself to 1/32 of the peak throughput per warp, which is a huge efficiency loss. Also, if the intention is to have each block compute a single inner product of a matrix row with the stacked vectors, why are you only launching 83 blocks? Shouldn’t there be 1313 blocks?

To answer the second question, how to use multiple threads per block depends on the storage order of the matrix. If it is row-major, then you can divide the inner product over a number of threads and then use a shared memory reduction to compute the final dot product. If the matrix is in column-major order, then memory coalescing means you should probably use one thread for each dot product.

You can read a bit of brainstorming on this sort of gemv type operation in this thread if you are interested.

avidday · November 11, 2010, 11:33pm

Whether that code works or not (and I doubt it actually does what you think it does), by using one thread per block you are limiting yourself to 1/32 of the peak throughput per warp, which is a huge efficiency loss. Also, if the intention is to have each block compute a single inner product of a matrix row with the stacked vectors, why are you only launching 83 blocks? Shouldn’t there be 1313 blocks?

To answer the second question, how to use multiple threads per block depends on the storage order of the matrix. If it is row-major, then you can divide the inner product over a number of threads and then use a shared memory reduction to compute the final dot product. If the matrix is in column-major order, then memory coalescing means you should probably use one thread for each dot product.

You can read a bit of brainstorming on this sort of gemv type operation in this thread if you are interested.

vivekv80 · November 12, 2010, 12:16am

@avidday: the variable val saves the value of 1 column and therefore, I should be using 1313 blocks. Thanks for that pointer.

Since it’s column-major I am using 1 thread per product. (but doesn’t that waste time as well?)

I did see your thread on gemv and I have a qn regd the profile info. How do you have the glob mem read throughput option enabled. I cannot see that counter mentioned in the Compute visual profiler document

vivekv80 · November 12, 2010, 12:16am

@avidday: the variable val saves the value of 1 column and therefore, I should be using 1313 blocks. Thanks for that pointer.

Since it’s column-major I am using 1 thread per product. (but doesn’t that waste time as well?)

I did see your thread on gemv and I have a qn regd the profile info. How do you have the glob mem read throughput option enabled. I cannot see that counter mentioned in the Compute visual profiler document

vivekv80 · November 12, 2010, 12:27am

also how do I write the value from ‘val’ to ‘out[blockIdx.x]’

I see “Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.”

vivekv80 · November 12, 2010, 12:27am

also how do I write the value from ‘val’ to ‘out[blockIdx.x]’

I see “Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.”

avidday · November 12, 2010, 2:42pm

Your tidx calculations are still broken, so you will need to fix that before anything has a hope of working. I am pretty confident that your indexing into the matrix is also broken.

Why would it waste time? If you use a sensible number threads per block (hint: 1 isn’t sensible, but 32 or 64 is). You will need to adjust the number of blocks accordingly.

Those profiling results were done a while go with an older version of the toolkit. It might be that it has changed in more recent versions of the profiler, but I don’t have access to either hardware or documentation right now to check.

avidday · November 12, 2010, 2:42pm

Your tidx calculations are still broken, so you will need to fix that before anything has a hope of working. I am pretty confident that your indexing into the matrix is also broken.

Why would it waste time? If you use a sensible number threads per block (hint: 1 isn’t sensible, but 32 or 64 is). You will need to adjust the number of blocks accordingly.

Those profiling results were done a while go with an older version of the toolkit. It might be that it has changed in more recent versions of the profiler, but I don’t have access to either hardware or documentation right now to check.

vivekv80 · November 12, 2010, 8:37pm

changed the tid dimensions to realize a column-major calculation

recon_reduce<<< 1313, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //const int threads = 1;

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	__shared__ float val1[threads];

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

	

	out[blockIdx.x] = val;

}

In the CUDA debugger, I check the value of variable ‘val’ and it sums up all the columns of the matrix correctly. The error occurs when I try to write the value of ‘val’ to ‘out’

My debug output:

[Launch of CUDA Kernel 0 (recon_reduce) on Device 0]

[Switching to CUDA Kernel 0 (<<<(0,0),(0,0,0)>>>)]

Breakpoint 1, recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:50

50		int tid = threadIdx.x + blockIdx.x*blockDim.x;

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) n

60				val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$2 = 7362.4541 // correct value

(cuda-gdb) step 298

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$6 = 170347776 // correct value

(cuda-gdb) n

67			out[blockIdx.x] = val;

(cuda-gdb) n

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:92

92	}

Also is it possible to use multiple threads when doing a column-major calculation. I initially thought that 1313 columns could be divided into 82 blocks with 16 threads each. this way I would be able to have each block calculate 16 columns in parallel. Is this correct?

vivekv80 · November 12, 2010, 8:37pm

changed the tid dimensions to realize a column-major calculation

recon_reduce<<< 1313, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //const int threads = 1;

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	__shared__ float val1[threads];

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

	

	out[blockIdx.x] = val;

}

In the CUDA debugger, I check the value of variable ‘val’ and it sums up all the columns of the matrix correctly. The error occurs when I try to write the value of ‘val’ to ‘out’

My debug output:

[Launch of CUDA Kernel 0 (recon_reduce) on Device 0]

[Switching to CUDA Kernel 0 (<<<(0,0),(0,0,0)>>>)]

Breakpoint 1, recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:50

50		int tid = threadIdx.x + blockIdx.x*blockDim.x;

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) n

60				val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$2 = 7362.4541 // correct value

(cuda-gdb) step 298

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$6 = 170347776 // correct value

(cuda-gdb) n

67			out[blockIdx.x] = val;

(cuda-gdb) n

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:92

92	}

Also is it possible to use multiple threads when doing a column-major calculation. I initially thought that 1313 columns could be divided into 82 blocks with 16 threads each. this way I would be able to have each block calculate 16 columns in parallel. Is this correct?

avidday · November 13, 2010, 8:18am

There are still two glaring errors in that code. You never initialise val before you use it, and you are not writing the result into the correct index in out.

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:92

92	}

Look at the value of [font=“Courier New”]out[/font] compared to the other device pointers. I would guess you are either passing a host pointer as [font=“Courier New”]out,[/font] or using something that isn’t allocated. In any case the error is in the calling host code.

Of course. But the warp size is 32 in CUDA. You should always use a round multiple of the warp size for the number of threads per block. 16 threads per block leaves 50% of the computational resources per warp idle.

avidday · November 13, 2010, 8:18am

There are still two glaring errors in that code. You never initialise val before you use it, and you are not writing the result into the correct index in out.

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:92

92	}

Look at the value of [font=“Courier New”]out[/font] compared to the other device pointers. I would guess you are either passing a host pointer as [font=“Courier New”]out,[/font] or using something that isn’t allocated. In any case the error is in the calling host code.

Of course. But the warp size is 32 in CUDA. You should always use a round multiple of the warp size for the number of threads per block. 16 threads per block leaves 50% of the computational resources per warp idle.

vivekv80 · November 15, 2010, 10:28pm

@avidday: Thanks for the hints.

rectified a memory allocation error :(

Changed kernel to run 32 threads:

const int threads =32;

const int num_blocks = (nac+threads-1)/threads; //nac = 1313

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //42 blocks, 32 threads

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val = 0;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*i+tid] * x[i]) + (im[(2*i+1)*nact+tid] * y[i]);

		}

		out[tid] = val;

}

The timed execution of this kernel using ‘cudaEventRecord’ results in 0.43130 ms and the kernel profile shows GPU Time = 265.856 and CPU Time = 290. Also, I see 177 errors when comparing the GPU results with CPU results.

Any suggestions regarding reducing the execution time and eliminating the errors?

Thanks in advance :)

vivekv80 · November 15, 2010, 10:28pm

@avidday: Thanks for the hints.

rectified a memory allocation error :(

Changed kernel to run 32 threads:

const int threads =32;

const int num_blocks = (nac+threads-1)/threads; //nac = 1313

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //42 blocks, 32 threads

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val = 0;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*i+tid] * x[i]) + (im[(2*i+1)*nact+tid] * y[i]);

		}

		out[tid] = val;

}

The timed execution of this kernel using ‘cudaEventRecord’ results in 0.43130 ms and the kernel profile shows GPU Time = 265.856 and CPU Time = 290. Also, I see 177 errors when comparing the GPU results with CPU results.

Any suggestions regarding reducing the execution time and eliminating the errors?

Thanks in advance :)

avidday · November 16, 2010, 6:41am

When you say “errors” do you really mean “minor discrepancies which can be explained by the nature of single precision float point arithmetic”, or something else?

As for peformance, there are some things that might speed it up. Right now the kernel has more integer operations per dot product than floating point ones. There is probably some low hanging fruit there. Also you might want to investigate using shared memory to buffer the x and y vectors. On older hardware it probably should be faster, on Fermi the cache structure might get you most of the way there automagically.

Ultimately the PTX will tell you more than I can about how good a job you and the compiler are doing, but this is a small problem without a lot of FLOPs in it, so don’t expect miracles.

avidday · November 16, 2010, 6:41am

When you say “errors” do you really mean “minor discrepancies which can be explained by the nature of single precision float point arithmetic”, or something else?

As for peformance, there are some things that might speed it up. Right now the kernel has more integer operations per dot product than floating point ones. There is probably some low hanging fruit there. Also you might want to investigate using shared memory to buffer the x and y vectors. On older hardware it probably should be faster, on Fermi the cache structure might get you most of the way there automagically.

Ultimately the PTX will tell you more than I can about how good a job you and the compiler are doing, but this is a small problem without a lot of FLOPs in it, so don’t expect miracles.

vivekv80 · November 16, 2010, 4:12pm

@avidday: errors are due to accuracy in the last two digits of the values, which may be normal. Thanks for the shared memory suggestion, will look into that.

The timing 0.43130 ms is obtained on a Fermi C2050. Right now I am computing the reconstruction matrix for only 150 images (300x1313). Eventually, I need to scale to 1800 images. In that case, I will have (3600x1313), which will be a bigger problem size for the C2050.
questions and observations:

I was looking at the 25% occupancy rule mentioned in Section 4.3 of the CUDA C Best Practices Guide. It suggests running at least 192 threads on a Tesla C1060 and 384 threads on a Fermi C2050. However, for 150 images, the computation time increases when I use more than 32 threads.
In the MY_GEMV example, you used global static kernels, does that improve the performance?
Also, when you say, that this problem has more integer operation per dot product, can you elaborate on it?

Thanks in advance.

vivekv80 · November 16, 2010, 4:12pm

@avidday: errors are due to accuracy in the last two digits of the values, which may be normal. Thanks for the shared memory suggestion, will look into that.

The timing 0.43130 ms is obtained on a Fermi C2050. Right now I am computing the reconstruction matrix for only 150 images (300x1313). Eventually, I need to scale to 1800 images. In that case, I will have (3600x1313), which will be a bigger problem size for the C2050.
questions and observations:

I was looking at the 25% occupancy rule mentioned in Section 4.3 of the CUDA C Best Practices Guide. It suggests running at least 192 threads on a Tesla C1060 and 384 threads on a Fermi C2050. However, for 150 images, the computation time increases when I use more than 32 threads.
In the MY_GEMV example, you used global static kernels, does that improve the performance?
Also, when you say, that this problem has more integer operation per dot product, can you elaborate on it?

Thanks in advance.

Topic		Replies	Views
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7972	July 22, 2010
Matrix multiplication + reduction with atomicAdd is slower than CPU code, how can I make it faster? CUDA Programming and Performance	17	3018	January 14, 2018
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3490	July 25, 2009
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13481	January 24, 2009
performance issue CUDA Programming and Performance	21	10524	April 30, 2007
Matrix by Vector multiplication CUDA Programming and Performance	8	1794	September 30, 2010
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10161	January 18, 2012
Efficient use of shared memory CUDA Programming and Performance	29	4877	December 2, 2019
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6644	April 13, 2009
How to modify with matrix CUDA Programming and Performance	0	540	May 5, 2018

matrix multiply reduction

Related topics