Newbie question about cublas

barisdad · November 17, 2010, 7:29am

Hi,
I am currently using intel tools (MKL and TBB) and I am considering moving to cuda since the performance isn’t good enough.
My app mostly does a lot of BLAS calls on small data structures (matrices of up to ~ 100x100). When I use Intel’s tools I use single threaded calls in MKL and use TBB to run the calls from multiple threads, this way the CPU is utilized better because the default threaded version of MKL isn’t useful (since the data structures are too small).
Is there any way to do something similar using cublas? Since I am new to cuda I am not really familiar with the terminology, but I guess this would be running multiple streams / kernels, and have every stream/kernel issue multiple BLAS calls (I may be wrong of course).

Thanks.

barisdad · November 17, 2010, 7:29am

Hi,
I am currently using intel tools (MKL and TBB) and I am considering moving to cuda since the performance isn’t good enough.
My app mostly does a lot of BLAS calls on small data structures (matrices of up to ~ 100x100). When I use Intel’s tools I use single threaded calls in MKL and use TBB to run the calls from multiple threads, this way the CPU is utilized better because the default threaded version of MKL isn’t useful (since the data structures are too small).
Is there any way to do something similar using cublas? Since I am new to cuda I am not really familiar with the terminology, but I guess this would be running multiple streams / kernels, and have every stream/kernel issue multiple BLAS calls (I may be wrong of course).

Thanks.

avidday · November 17, 2010, 7:58am

Cublas probably won’t be useful for you if your matrices are only 100x100. The pci-e bus transfer overhead to the gpu will dominate the total compute time and any speed up over the CPU will be lost. The cublas API is a host side API, it can’t be called from inside compute kernels running on the device.

avidday · November 17, 2010, 7:58am

Cublas probably won’t be useful for you if your matrices are only 100x100. The pci-e bus transfer overhead to the gpu will dominate the total compute time and any speed up over the CPU will be lost. The cublas API is a host side API, it can’t be called from inside compute kernels running on the device.

barisdad · November 17, 2010, 8:27am

Thanks.
My app is iterative, and I plan to keep the matrices inside device memory so PCI transfer time should not be a problem.
The host-API thing is a problem… I guess this means that I have to implement to BLAS functions myself.

barisdad · November 17, 2010, 8:27am

Thanks.
My app is iterative, and I plan to keep the matrices inside device memory so PCI transfer time should not be a problem.
The host-API thing is a problem… I guess this means that I have to implement to BLAS functions myself.

philippev · November 19, 2010, 6:27am

If you can keep the data on the device, then you have a better chance.

Cublas is highly optimized, so reimplementing them would be a waste of time unless you have very specific datas that allows you to take shortcut.
Cublas is a Host API but the data ( vector, matrices) are expected to be device pointers. So once you have pushed your data to the GPU, you can call successively Cublas routines without shuffling data back and forth between GPU and CPU.

To get the most on small data set, you should definitively try to use concurrent kernels (using streams) on independent datas (available only on Fermi architecture) to maximize your GPU occupancy

philippev · November 19, 2010, 6:27am

If you can keep the data on the device, then you have a better chance.

Cublas is highly optimized, so reimplementing them would be a waste of time unless you have very specific datas that allows you to take shortcut.
Cublas is a Host API but the data ( vector, matrices) are expected to be device pointers. So once you have pushed your data to the GPU, you can call successively Cublas routines without shuffling data back and forth between GPU and CPU.

To get the most on small data set, you should definitively try to use concurrent kernels (using streams) on independent datas (available only on Fermi architecture) to maximize your GPU occupancy

barisdad · November 21, 2010, 11:09am

Oh, I thought that cublas doesn’t support streams.
I’ll elaborate a little about what I need.
I am running a genetic algorithm which optimizes a certain objective function. I have about 400 agents, each one runs 10000-50000 evaluations of the objective function. The objective function is stateful (each invokation depends on all the previous ones).
The objective function has 3 stages, stages 1a-1c can be run concurrently, stage 2 should run after stages 1a-1c, and stage 3 should run after stage 2.
Each stage calls dot product, 4 matrix vector products, and sums the results of all 5 calls. the right-hand vector operands of all calls are the same.
Matrix and vector sizes vary between 25x25 to 100x100.

Here is what I thought of doing:

For each sample, run all agents concurrently
write 2 kernels, or use cublas functions: one for matrix vector product and one for dot product
call 15 concurrent kernels at the beginning: one for each of the actions in stages 1a-1c.
wait for all kernels to end
call 5 concurrent kernels which run stage 2
wait for all kernels to finish
call 5 concurrent kernels which run stage 3

regarding the kernel design itself - The only thing I could find was this:
http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBYQFjAA&url=http%3A%2F%2Fwww.worldscinet.com%2Fppl%2Fmkt%2Fpreserved-docs%2F1804%2FS0129626408003545.pdf&rct=j&q=cuda%20matrix%20vector%20multiplication&ei=Ev3oTO2aEsjtsgbPqfWbCw&usg=AFQjCNGycTBTbWZGXKwNJmraDxEkGJpZ4w&cad=rja

This is totally unreadable… I also applied as a registered developer to try to look at the cublas source code.
I am also not sure if I should use memory optimizations. The only thing I am pretty sure about doing is to place the right-hand vector operands in constant memory since all the participating threads should read them. I am also not sure about the sizes of blocks.

barisdad · November 21, 2010, 11:09am

Oh, I thought that cublas doesn’t support streams.
I’ll elaborate a little about what I need.
I am running a genetic algorithm which optimizes a certain objective function. I have about 400 agents, each one runs 10000-50000 evaluations of the objective function. The objective function is stateful (each invokation depends on all the previous ones).
The objective function has 3 stages, stages 1a-1c can be run concurrently, stage 2 should run after stages 1a-1c, and stage 3 should run after stage 2.
Each stage calls dot product, 4 matrix vector products, and sums the results of all 5 calls. the right-hand vector operands of all calls are the same.
Matrix and vector sizes vary between 25x25 to 100x100.

Here is what I thought of doing:

For each sample, run all agents concurrently
write 2 kernels, or use cublas functions: one for matrix vector product and one for dot product
call 15 concurrent kernels at the beginning: one for each of the actions in stages 1a-1c.
wait for all kernels to end
call 5 concurrent kernels which run stage 2
wait for all kernels to finish
call 5 concurrent kernels which run stage 3

regarding the kernel design itself - The only thing I could find was this:
http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBYQFjAA&url=http%3A%2F%2Fwww.worldscinet.com%2Fppl%2Fmkt%2Fpreserved-docs%2F1804%2FS0129626408003545.pdf&rct=j&q=cuda%20matrix%20vector%20multiplication&ei=Ev3oTO2aEsjtsgbPqfWbCw&usg=AFQjCNGycTBTbWZGXKwNJmraDxEkGJpZ4w&cad=rja

This is totally unreadable… I also applied as a registered developer to try to look at the cublas source code.
I am also not sure if I should use memory optimizations. The only thing I am pretty sure about doing is to place the right-hand vector operands in constant memory since all the participating threads should read them. I am also not sure about the sizes of blocks.

dakuang · December 2, 2010, 6:34pm

maybe you want to use cublasSetKernelStream, which is available since CUBLAS 3.1

Topic		Replies	Views
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	5783	September 30, 2015
Does CUBLAS 4 RC-2 support using multiple contexts from a single host-thread? CUDA Programming and Performance	11	10619	August 19, 2011
Concurrent kernels on Kepler CUDA Programming and Performance	8	1052	February 23, 2014
Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code? CUDA Programming and Performance	11	12882	May 8, 2010
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	4969	February 10, 2011
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10004	March 24, 2014
cublas SEGFAULT in cublasInit() cublas SEGFAULT in cublasInit() but locally compiled examples run. CUDA Programming and Performance	11	4686	January 19, 2010
about running cuda on a gpu cluster CUDA Programming and Performance	25	21581	May 31, 2010
CUBLAS Level 1 and Level 2 BLAS has 0 computaional time. Is it correct? Assesment of the CUBLAS leve CUDA Programming and Performance	3	3684	April 24, 2009
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7031	April 28, 2011

Newbie question about cublas

Related topics