matrix multiplication with constant cache

We developed one version of matrix multiplication by exploiting the bandwidth of constant memory.
The constant cache can reach about 2.4 Terabytes/s, if we can access it following good pattern: broadcast and sequential.
The matrix multiplication algorithm is A*B=C. The A and B are input matrices, and C is the output matrix.
We can partition A into multiple 128X16 tiles and put these tiles into constant memory.
Because the constant memory size limitation, we only use four 128X16 constant memories and run 4 kernels concurrently.
We also optimize the code for specific input size (the width of B is 4k, 5k, …, 8k) (mm_const_xK)

To try the code, put mm_const into “NVIDIA_GPU_Computing_SDK/C/src/”, compile the code with “make clean; make maxregisters=32”, and run the code “…/…/bin/linux/release/mm_const”

Some of the results on GTX 480 (cublas3.2 gets great improvement over cublas3.1) :
A (128, 1024), B (131072, 128), C (131072, 1024), cublas3.2(739gflops), const(1013gflops),
A (1024, 1024), B (131072, 1024), C (131072, 1024), cublas3.2(726gflops), const(952gflops),
A (7168, 7168), B (7168, 7168), C (7168, 7168), cublas3.2(823gflops), const(904gflops),

The source code with slides on GTC 2010 can also be found in our Source to Source GPGPU Compiler website (http://code.google.com/p/gpgpucompiler/)
mm_const_release.zip (10.7 KB)

We developed one version of matrix multiplication by exploiting the bandwidth of constant memory.
The constant cache can reach about 2.4 Terabytes/s, if we can access it following good pattern: broadcast and sequential.
The matrix multiplication algorithm is A*B=C. The A and B are input matrices, and C is the output matrix.
We can partition A into multiple 128X16 tiles and put these tiles into constant memory.
Because the constant memory size limitation, we only use four 128X16 constant memories and run 4 kernels concurrently.
We also optimize the code for specific input size (the width of B is 4k, 5k, …, 8k) (mm_const_xK)

To try the code, put mm_const into “NVIDIA_GPU_Computing_SDK/C/src/”, compile the code with “make clean; make maxregisters=32”, and run the code “…/…/bin/linux/release/mm_const”

Some of the results on GTX 480 (cublas3.2 gets great improvement over cublas3.1) :
A (128, 1024), B (131072, 128), C (131072, 1024), cublas3.2(739gflops), const(1013gflops),
A (1024, 1024), B (131072, 1024), C (131072, 1024), cublas3.2(726gflops), const(952gflops),
A (7168, 7168), B (7168, 7168), C (7168, 7168), cublas3.2(823gflops), const(904gflops),

The source code with slides on GTC 2010 can also be found in our Source to Source GPGPU Compiler website (http://code.google.com/p/gpgpucompiler/)

What is about double performance?

What is about double performance?

We didn’t implement the double version.

We didn’t implement the double version.

You can watch the actual presentation itself here.

You can watch the actual presentation itself here.

Hi Yangyi…,

Congratulations on this good work and also on the GPGPUCompiler work!

A short look at the code suggests to me that the overlap of memory-copies and computation is the primary reason for speedup! No?
Very smart one though! Am I right?

Thanks,
BEst REgards,
Sarnath

Hi Yangyi…,

Congratulations on this good work and also on the GPGPUCompiler work!

A short look at the code suggests to me that the overlap of memory-copies and computation is the primary reason for speedup! No?
Very smart one though! Am I right?

Thanks,
BEst REgards,
Sarnath

Thanks for your interest.

The major idea is to use the bandwidth of constant memory. But constant memory has some limitations. The overlap of memory-copies and computation remove the size limitation of constant memory: you can send data to small size constant memory of one kernel, while executing another kernel.

Thanks for your interest.

The major idea is to use the bandwidth of constant memory. But constant memory has some limitations. The overlap of memory-copies and computation remove the size limitation of constant memory: you can send data to small size constant memory of one kernel, while executing another kernel.

Hi Yang,

I don’t deny that fact. But I am just pointing out that the speedup may not be entirely due to “constant memory bandwidth”. It could be something to do with “overlap” of memory copy with kernel execution. So, it will be useful to do a shared memory implementation using “streams” and then profile which one fares better — just to make a fair claim. It is upto your team to decide on this.

Nonetheless, it is a smart idea and it is giving speedups. People should go and implement it in CUBLAS for higher matrix sizes.
Good work!

Best Regards,
Sarnath

Hi Yang,

I don’t deny that fact. But I am just pointing out that the speedup may not be entirely due to “constant memory bandwidth”. It could be something to do with “overlap” of memory copy with kernel execution. So, it will be useful to do a shared memory implementation using “streams” and then profile which one fares better — just to make a fair claim. It is upto your team to decide on this.

Nonetheless, it is a smart idea and it is giving speedups. People should go and implement it in CUBLAS for higher matrix sizes.
Good work!

Best Regards,
Sarnath

Good point. If we want to have one full version, we need to do this.

Yi

Good point. If we want to have one full version, we need to do this.

Yi