matrix multiplication with constant cache

yangyi0239 · October 12, 2010, 4:48am

We developed one version of matrix multiplication by exploiting the bandwidth of constant memory.
The constant cache can reach about 2.4 Terabytes/s, if we can access it following good pattern: broadcast and sequential.
The matrix multiplication algorithm is A*B=C. The A and B are input matrices, and C is the output matrix.
We can partition A into multiple 128X16 tiles and put these tiles into constant memory.
Because the constant memory size limitation, we only use four 128X16 constant memories and run 4 kernels concurrently.
We also optimize the code for specific input size (the width of B is 4k, 5k, …, 8k) (mm_const_xK)

To try the code, put mm_const into “NVIDIA_GPU_Computing_SDK/C/src/”, compile the code with “make clean; make maxregisters=32”, and run the code “…/…/bin/linux/release/mm_const”

Some of the results on GTX 480 (cublas3.2 gets great improvement over cublas3.1) :
A (128, 1024), B (131072, 128), C (131072, 1024), cublas3.2(739gflops), const(1013gflops),
A (1024, 1024), B (131072, 1024), C (131072, 1024), cublas3.2(726gflops), const(952gflops),
A (7168, 7168), B (7168, 7168), C (7168, 7168), cublas3.2(823gflops), const(904gflops),

The source code with slides on GTC 2010 can also be found in our Source to Source GPGPU Compiler website (Google Code Archive - Long-term storage for Google Code Project Hosting.)
mm_const_release.zip (10.7 KB)

yangyi0239 · October 12, 2010, 4:48am

We developed one version of matrix multiplication by exploiting the bandwidth of constant memory.
The constant cache can reach about 2.4 Terabytes/s, if we can access it following good pattern: broadcast and sequential.
The matrix multiplication algorithm is A*B=C. The A and B are input matrices, and C is the output matrix.
We can partition A into multiple 128X16 tiles and put these tiles into constant memory.
Because the constant memory size limitation, we only use four 128X16 constant memories and run 4 kernels concurrently.
We also optimize the code for specific input size (the width of B is 4k, 5k, …, 8k) (mm_const_xK)

To try the code, put mm_const into “NVIDIA_GPU_Computing_SDK/C/src/”, compile the code with “make clean; make maxregisters=32”, and run the code “…/…/bin/linux/release/mm_const”

Some of the results on GTX 480 (cublas3.2 gets great improvement over cublas3.1) :
A (128, 1024), B (131072, 128), C (131072, 1024), cublas3.2(739gflops), const(1013gflops),
A (1024, 1024), B (131072, 1024), C (131072, 1024), cublas3.2(726gflops), const(952gflops),
A (7168, 7168), B (7168, 7168), C (7168, 7168), cublas3.2(823gflops), const(904gflops),

The source code with slides on GTC 2010 can also be found in our Source to Source GPGPU Compiler website (Google Code Archive - Long-term storage for Google Code Project Hosting.)

Lev · October 12, 2010, 12:40pm

What is about double performance?

Lev · October 12, 2010, 12:40pm

What is about double performance?

yangyi0239 · October 12, 2010, 11:32pm

We didn’t implement the double version.

yangyi0239 · October 12, 2010, 11:32pm

We didn’t implement the double version.

SPWorley · October 13, 2010, 1:43am

You can watch the actual presentation itself here.

SPWorley · October 13, 2010, 1:43am

You can watch the actual presentation itself here.

Sarnath · October 22, 2010, 6:50am

Hi Yangyi…,

Congratulations on this good work and also on the GPGPUCompiler work!

A short look at the code suggests to me that the overlap of memory-copies and computation is the primary reason for speedup! No?
Very smart one though! Am I right?

Thanks,
BEst REgards,
Sarnath

Sarnath · October 22, 2010, 6:50am

Hi Yangyi…,

Congratulations on this good work and also on the GPGPUCompiler work!

A short look at the code suggests to me that the overlap of memory-copies and computation is the primary reason for speedup! No?
Very smart one though! Am I right?

Thanks,
BEst REgards,
Sarnath

yangyi0239 · October 27, 2010, 7:27pm

Thanks for your interest.

The major idea is to use the bandwidth of constant memory. But constant memory has some limitations. The overlap of memory-copies and computation remove the size limitation of constant memory: you can send data to small size constant memory of one kernel, while executing another kernel.

yangyi0239 · October 27, 2010, 7:27pm

Thanks for your interest.

The major idea is to use the bandwidth of constant memory. But constant memory has some limitations. The overlap of memory-copies and computation remove the size limitation of constant memory: you can send data to small size constant memory of one kernel, while executing another kernel.

Sarnath · October 29, 2010, 5:40am

Hi Yang,

I don’t deny that fact. But I am just pointing out that the speedup may not be entirely due to “constant memory bandwidth”. It could be something to do with “overlap” of memory copy with kernel execution. So, it will be useful to do a shared memory implementation using “streams” and then profile which one fares better — just to make a fair claim. It is upto your team to decide on this.

Nonetheless, it is a smart idea and it is giving speedups. People should go and implement it in CUBLAS for higher matrix sizes.
Good work!

Best Regards,
Sarnath

Sarnath · October 29, 2010, 5:40am

Hi Yang,

I don’t deny that fact. But I am just pointing out that the speedup may not be entirely due to “constant memory bandwidth”. It could be something to do with “overlap” of memory copy with kernel execution. So, it will be useful to do a shared memory implementation using “streams” and then profile which one fares better — just to make a fair claim. It is upto your team to decide on this.

Nonetheless, it is a smart idea and it is giving speedups. People should go and implement it in CUBLAS for higher matrix sizes.
Good work!

Best Regards,
Sarnath

yangyi0239 · October 31, 2010, 6:59am

Good point. If we want to have one full version, we need to do this.

Yi

yangyi0239 · October 31, 2010, 6:59am

Good point. If we want to have one full version, we need to do this.

Yi

Topic		Replies	Views
matrix multiplication can't achieve peak performanc CUDA Programming and Performance	9	2311	April 19, 2012
Matrix-Matrix Multiplication Accuracy and Performance Questions CUDA Programming and Performance	13	6574	April 16, 2007
performance of the matrix multiplication CUDA Programming and Performance	16	4356	September 11, 2013
Fastest matrix-vector multiplication? CUDA Programming and Performance	24	3999	May 21, 2011
Matrix Multiplication Garbage value :( CUDA Programming and Performance	10	3406	July 25, 2009
matrix multiplication CUDA Programming and Performance	10	3809	March 7, 2010
How to improve performance when multiply two matrices with large data in CUDA ? CUDA Programming and Performance	5	3923	March 19, 2014
best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about CUDA Programming and Performance	6	5601	August 12, 2009
Maximum matrix size for matrix multiplication operation on GeForce GTX 960M CUDA Programming and Performance	12	3691	November 28, 2018
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10006	January 18, 2012

matrix multiplication with constant cache

Related topics