matrix multiply reduction

I am trying to multiply two arrays x & y of height 150 with a matrix of width 1313 and height 300. after multiplying the rows of the matrix with x and y, I need to sum each of the columns resulting in a 1313 by 1

After going through the reduction and the matrix examples in the SDK,

I tried doing this:

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //num_blocks = 83, threads = 16, 83*16=1328

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*els_per_block; //els_per_block = 150

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

		

	if(threadIdx.x == 0)

	{

 		out[blockIdx.x] = val; 

	}

}

Two questions:

  1. When I debug the kernel, the variable val has the correct summed value of all the columns in the matrix. But when the execution reaches the line “out[blockIdx.x] = val;” I see the following error:
warning: single stepping a divergent thread.

Single stepping the warp until the divergent thread becomes active.  If the thread never becomes active, this operation may not complete.

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
  1. currently, I am using only 1 thread to do the multiply, add. (the profiler shows that this takes 225 us. If I use all the 16 threads to do the task then, do I need to instantiate 16 arrays to store the intermediate summed values?

suggestions? thanks in advance.

I am trying to multiply two arrays x & y of height 150 with a matrix of width 1313 and height 300. after multiplying the rows of the matrix with x and y, I need to sum each of the columns resulting in a 1313 by 1

After going through the reduction and the matrix examples in the SDK,

I tried doing this:

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //num_blocks = 83, threads = 16, 83*16=1328

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*els_per_block; //els_per_block = 150

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

		

	if(threadIdx.x == 0)

	{

 		out[blockIdx.x] = val; 

	}

}

Two questions:

  1. When I debug the kernel, the variable val has the correct summed value of all the columns in the matrix. But when the execution reaches the line “out[blockIdx.x] = val;” I see the following error:
warning: single stepping a divergent thread.

Single stepping the warp until the divergent thread becomes active.  If the thread never becomes active, this operation may not complete.

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
  1. currently, I am using only 1 thread to do the multiply, add. (the profiler shows that this takes 225 us. If I use all the 16 threads to do the task then, do I need to instantiate 16 arrays to store the intermediate summed values?

suggestions? thanks in advance.

Whether that code works or not (and I doubt it actually does what you think it does), by using one thread per block you are limiting yourself to 1/32 of the peak throughput per warp, which is a huge efficiency loss. Also, if the intention is to have each block compute a single inner product of a matrix row with the stacked vectors, why are you only launching 83 blocks? Shouldn’t there be 1313 blocks?

To answer the second question, how to use multiple threads per block depends on the storage order of the matrix. If it is row-major, then you can divide the inner product over a number of threads and then use a shared memory reduction to compute the final dot product. If the matrix is in column-major order, then memory coalescing means you should probably use one thread for each dot product.

You can read a bit of brainstorming on this sort of gemv type operation in this thread if you are interested.

Whether that code works or not (and I doubt it actually does what you think it does), by using one thread per block you are limiting yourself to 1/32 of the peak throughput per warp, which is a huge efficiency loss. Also, if the intention is to have each block compute a single inner product of a matrix row with the stacked vectors, why are you only launching 83 blocks? Shouldn’t there be 1313 blocks?

To answer the second question, how to use multiple threads per block depends on the storage order of the matrix. If it is row-major, then you can divide the inner product over a number of threads and then use a shared memory reduction to compute the final dot product. If the matrix is in column-major order, then memory coalescing means you should probably use one thread for each dot product.

You can read a bit of brainstorming on this sort of gemv type operation in this thread if you are interested.

@avidday: the variable val saves the value of 1 column and therefore, I should be using 1313 blocks. Thanks for that pointer.

Since it’s column-major I am using 1 thread per product. (but doesn’t that waste time as well?)

I did see your thread on gemv and I have a qn regd the profile info. How do you have the glob mem read throughput option enabled. I cannot see that counter mentioned in the Compute visual profiler document

@avidday: the variable val saves the value of 1 column and therefore, I should be using 1313 blocks. Thanks for that pointer.

Since it’s column-major I am using 1 thread per product. (but doesn’t that waste time as well?)

I did see your thread on gemv and I have a qn regd the profile info. How do you have the glob mem read throughput option enabled. I cannot see that counter mentioned in the Compute visual profiler document

also how do I write the value from ‘val’ to ‘out[blockIdx.x]’

I see “Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.”

also how do I write the value from ‘val’ to ‘out[blockIdx.x]’

I see “Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.”

Your tidx calculations are still broken, so you will need to fix that before anything has a hope of working. I am pretty confident that your indexing into the matrix is also broken.

Why would it waste time? If you use a sensible number threads per block (hint: 1 isn’t sensible, but 32 or 64 is). You will need to adjust the number of blocks accordingly.

Those profiling results were done a while go with an older version of the toolkit. It might be that it has changed in more recent versions of the profiler, but I don’t have access to either hardware or documentation right now to check.

Your tidx calculations are still broken, so you will need to fix that before anything has a hope of working. I am pretty confident that your indexing into the matrix is also broken.

Why would it waste time? If you use a sensible number threads per block (hint: 1 isn’t sensible, but 32 or 64 is). You will need to adjust the number of blocks accordingly.

Those profiling results were done a while go with an older version of the toolkit. It might be that it has changed in more recent versions of the profiler, but I don’t have access to either hardware or documentation right now to check.

changed the tid dimensions to realize a column-major calculation

recon_reduce<<< 1313, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //const int threads = 1;

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	__shared__ float val1[threads];

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

	

	out[blockIdx.x] = val;

}

In the CUDA debugger, I check the value of variable ‘val’ and it sums up all the columns of the matrix correctly. The error occurs when I try to write the value of ‘val’ to ‘out’

My debug output:

[Launch of CUDA Kernel 0 (recon_reduce) on Device 0]

[Switching to CUDA Kernel 0 (<<<(0,0),(0,0,0)>>>)]

Breakpoint 1, recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:50

50		int tid = threadIdx.x + blockIdx.x*blockDim.x;

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) n

60				val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$2 = 7362.4541 // correct value

(cuda-gdb) step 298

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$6 = 170347776 // correct value

(cuda-gdb) n

67			out[blockIdx.x] = val;

(cuda-gdb) n

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:92

92	}

Also is it possible to use multiple threads when doing a column-major calculation. I initially thought that 1313 columns could be divided into 82 blocks with 16 threads each. this way I would be able to have each block calculate 16 columns in parallel. Is this correct?

changed the tid dimensions to realize a column-major calculation

recon_reduce<<< 1313, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //const int threads = 1;

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	__shared__ float val1[threads];

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

		}

	

	out[blockIdx.x] = val;

}

In the CUDA debugger, I check the value of variable ‘val’ and it sums up all the columns of the matrix correctly. The error occurs when I try to write the value of ‘val’ to ‘out’

My debug output:

[Launch of CUDA Kernel 0 (recon_reduce) on Device 0]

[Switching to CUDA Kernel 0 (<<<(0,0),(0,0,0)>>>)]

Breakpoint 1, recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:50

50		int tid = threadIdx.x + blockIdx.x*blockDim.x;

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) n

60				val += (im[2*nact*(tid+i)] * x[i]) + (im[(2*(tid+i)+1)*nact] * y[i]);

(cuda-gdb) n

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$2 = 7362.4541 // correct value

(cuda-gdb) step 298

56			for(int i = 0; i < iters; i++)

(cuda-gdb) p val

$6 = 170347776 // correct value

(cuda-gdb) n

67			out[blockIdx.x] = val;

(cuda-gdb) n

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.

recon_reduce<<<(1313,1),(1,1,1)>>> (im=0xfc00000000, x=0xfc001a0000, y=0xfc001a0400, out=0x2b86c83b8000)

    at recon_mtx_150_kernel.cu:92

92	}

Also is it possible to use multiple threads when doing a column-major calculation. I initially thought that 1313 columns could be divided into 82 blocks with 16 threads each. this way I would be able to have each block calculate 16 columns in parallel. Is this correct?

There are still two glaring errors in that code. You never initialise val before you use it, and you are not writing the result into the correct index in out.

Look at the value of [font=“Courier New”]out[/font] compared to the other device pointers. I would guess you are either passing a host pointer as [font=“Courier New”]out,[/font] or using something that isn’t allocated. In any case the error is in the calling host code.

Of course. But the warp size is 32 in CUDA. You should always use a round multiple of the warp size for the number of threads per block. 16 threads per block leaves 50% of the computational resources per warp idle.

There are still two glaring errors in that code. You never initialise val before you use it, and you are not writing the result into the correct index in out.

Look at the value of [font=“Courier New”]out[/font] compared to the other device pointers. I would guess you are either passing a host pointer as [font=“Courier New”]out,[/font] or using something that isn’t allocated. In any case the error is in the calling host code.

Of course. But the warp size is 32 in CUDA. You should always use a round multiple of the warp size for the number of threads per block. 16 threads per block leaves 50% of the computational resources per warp idle.

@avidday: Thanks for the hints.

rectified a memory allocation error :(

Changed kernel to run 32 threads:

const int threads =32;

const int num_blocks = (nac+threads-1)/threads; //nac = 1313

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //42 blocks, 32 threads

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val = 0;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*i+tid] * x[i]) + (im[(2*i+1)*nact+tid] * y[i]);

		}

		out[tid] = val;

}

The timed execution of this kernel using ‘cudaEventRecord’ results in 0.43130 ms and the kernel profile shows GPU Time = 265.856 and CPU Time = 290. Also, I see 177 errors when comparing the GPU results with CPU results.

Any suggestions regarding reducing the execution time and eliminating the errors?

Thanks in advance :)

@avidday: Thanks for the hints.

rectified a memory allocation error :(

Changed kernel to run 32 threads:

const int threads =32;

const int num_blocks = (nac+threads-1)/threads; //nac = 1313

recon_reduce<<< num_blocks, threads>>>(mtx_imd, xsh_d, ysh_d, fac_d); //42 blocks, 32 threads

__global__ void recon_reduce(float* im, float* x, float* y, float* out)

{

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	float val = 0;

	const int iters = 150; 

	const int nact = 1313;

	

#pragma unroll

		for(int i = 0; i < iters; i++)

		{

			val += (im[2*nact*i+tid] * x[i]) + (im[(2*i+1)*nact+tid] * y[i]);

		}

		out[tid] = val;

}

The timed execution of this kernel using ‘cudaEventRecord’ results in 0.43130 ms and the kernel profile shows GPU Time = 265.856 and CPU Time = 290. Also, I see 177 errors when comparing the GPU results with CPU results.

Any suggestions regarding reducing the execution time and eliminating the errors?

Thanks in advance :)

When you say “errors” do you really mean “minor discrepancies which can be explained by the nature of single precision float point arithmetic”, or something else?

As for peformance, there are some things that might speed it up. Right now the kernel has more integer operations per dot product than floating point ones. There is probably some low hanging fruit there. Also you might want to investigate using shared memory to buffer the x and y vectors. On older hardware it probably should be faster, on Fermi the cache structure might get you most of the way there automagically.

Ultimately the PTX will tell you more than I can about how good a job you and the compiler are doing, but this is a small problem without a lot of FLOPs in it, so don’t expect miracles.

When you say “errors” do you really mean “minor discrepancies which can be explained by the nature of single precision float point arithmetic”, or something else?

As for peformance, there are some things that might speed it up. Right now the kernel has more integer operations per dot product than floating point ones. There is probably some low hanging fruit there. Also you might want to investigate using shared memory to buffer the x and y vectors. On older hardware it probably should be faster, on Fermi the cache structure might get you most of the way there automagically.

Ultimately the PTX will tell you more than I can about how good a job you and the compiler are doing, but this is a small problem without a lot of FLOPs in it, so don’t expect miracles.

@avidday: errors are due to accuracy in the last two digits of the values, which may be normal. Thanks for the shared memory suggestion, will look into that.

The timing 0.43130 ms is obtained on a Fermi C2050. Right now I am computing the reconstruction matrix for only 150 images (300x1313). Eventually, I need to scale to 1800 images. In that case, I will have (3600x1313), which will be a bigger problem size for the C2050.
questions and observations:

  1. I was looking at the 25% occupancy rule mentioned in Section 4.3 of the CUDA C Best Practices Guide. It suggests running at least 192 threads on a Tesla C1060 and 384 threads on a Fermi C2050. However, for 150 images, the computation time increases when I use more than 32 threads.
  2. In the MY_GEMV example, you used global static kernels, does that improve the performance?
  3. Also, when you say, that this problem has more integer operation per dot product, can you elaborate on it?

Thanks in advance.

@avidday: errors are due to accuracy in the last two digits of the values, which may be normal. Thanks for the shared memory suggestion, will look into that.

The timing 0.43130 ms is obtained on a Fermi C2050. Right now I am computing the reconstruction matrix for only 150 images (300x1313). Eventually, I need to scale to 1800 images. In that case, I will have (3600x1313), which will be a bigger problem size for the C2050.
questions and observations:

  1. I was looking at the 25% occupancy rule mentioned in Section 4.3 of the CUDA C Best Practices Guide. It suggests running at least 192 threads on a Tesla C1060 and 384 threads on a Fermi C2050. However, for 150 images, the computation time increases when I use more than 32 threads.
  2. In the MY_GEMV example, you used global static kernels, does that improve the performance?
  3. Also, when you say, that this problem has more integer operation per dot product, can you elaborate on it?

Thanks in advance.