Newbie question about cublas

Hi,
I am currently using intel tools (MKL and TBB) and I am considering moving to cuda since the performance isn’t good enough.
My app mostly does a lot of BLAS calls on small data structures (matrices of up to ~ 100x100). When I use Intel’s tools I use single threaded calls in MKL and use TBB to run the calls from multiple threads, this way the CPU is utilized better because the default threaded version of MKL isn’t useful (since the data structures are too small).
Is there any way to do something similar using cublas? Since I am new to cuda I am not really familiar with the terminology, but I guess this would be running multiple streams / kernels, and have every stream/kernel issue multiple BLAS calls (I may be wrong of course).

Thanks.

Hi,
I am currently using intel tools (MKL and TBB) and I am considering moving to cuda since the performance isn’t good enough.
My app mostly does a lot of BLAS calls on small data structures (matrices of up to ~ 100x100). When I use Intel’s tools I use single threaded calls in MKL and use TBB to run the calls from multiple threads, this way the CPU is utilized better because the default threaded version of MKL isn’t useful (since the data structures are too small).
Is there any way to do something similar using cublas? Since I am new to cuda I am not really familiar with the terminology, but I guess this would be running multiple streams / kernels, and have every stream/kernel issue multiple BLAS calls (I may be wrong of course).

Thanks.

Cublas probably won’t be useful for you if your matrices are only 100x100. The pci-e bus transfer overhead to the gpu will dominate the total compute time and any speed up over the CPU will be lost. The cublas API is a host side API, it can’t be called from inside compute kernels running on the device.

Cublas probably won’t be useful for you if your matrices are only 100x100. The pci-e bus transfer overhead to the gpu will dominate the total compute time and any speed up over the CPU will be lost. The cublas API is a host side API, it can’t be called from inside compute kernels running on the device.

Thanks.
My app is iterative, and I plan to keep the matrices inside device memory so PCI transfer time should not be a problem.
The host-API thing is a problem… I guess this means that I have to implement to BLAS functions myself.

Thanks.
My app is iterative, and I plan to keep the matrices inside device memory so PCI transfer time should not be a problem.
The host-API thing is a problem… I guess this means that I have to implement to BLAS functions myself.

If you can keep the data on the device, then you have a better chance.

Cublas is highly optimized, so reimplementing them would be a waste of time unless you have very specific datas that allows you to take shortcut.
Cublas is a Host API but the data ( vector, matrices) are expected to be device pointers. So once you have pushed your data to the GPU, you can call successively Cublas routines without shuffling data back and forth between GPU and CPU.

To get the most on small data set, you should definitively try to use concurrent kernels (using streams) on independent datas (available only on Fermi architecture) to maximize your GPU occupancy

If you can keep the data on the device, then you have a better chance.

Cublas is highly optimized, so reimplementing them would be a waste of time unless you have very specific datas that allows you to take shortcut.
Cublas is a Host API but the data ( vector, matrices) are expected to be device pointers. So once you have pushed your data to the GPU, you can call successively Cublas routines without shuffling data back and forth between GPU and CPU.

To get the most on small data set, you should definitively try to use concurrent kernels (using streams) on independent datas (available only on Fermi architecture) to maximize your GPU occupancy

Oh, I thought that cublas doesn’t support streams.
I’ll elaborate a little about what I need.
I am running a genetic algorithm which optimizes a certain objective function. I have about 400 agents, each one runs 10000-50000 evaluations of the objective function. The objective function is stateful (each invokation depends on all the previous ones).
The objective function has 3 stages, stages 1a-1c can be run concurrently, stage 2 should run after stages 1a-1c, and stage 3 should run after stage 2.
Each stage calls dot product, 4 matrix vector products, and sums the results of all 5 calls. the right-hand vector operands of all calls are the same.
Matrix and vector sizes vary between 25x25 to 100x100.

Here is what I thought of doing:

  • For each sample, run all agents concurrently
  • write 2 kernels, or use cublas functions: one for matrix vector product and one for dot product
  • call 15 concurrent kernels at the beginning: one for each of the actions in stages 1a-1c.
  • wait for all kernels to end
  • call 5 concurrent kernels which run stage 2
  • wait for all kernels to finish
  • call 5 concurrent kernels which run stage 3

regarding the kernel design itself - The only thing I could find was this:
http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBYQFjAA&url=http%3A%2F%2Fwww.worldscinet.com%2Fppl%2Fmkt%2Fpreserved-docs%2F1804%2FS0129626408003545.pdf&rct=j&q=cuda%20matrix%20vector%20multiplication&ei=Ev3oTO2aEsjtsgbPqfWbCw&usg=AFQjCNGycTBTbWZGXKwNJmraDxEkGJpZ4w&cad=rja

This is totally unreadable… I also applied as a registered developer to try to look at the cublas source code.
I am also not sure if I should use memory optimizations. The only thing I am pretty sure about doing is to place the right-hand vector operands in constant memory since all the participating threads should read them. I am also not sure about the sizes of blocks.

Oh, I thought that cublas doesn’t support streams.
I’ll elaborate a little about what I need.
I am running a genetic algorithm which optimizes a certain objective function. I have about 400 agents, each one runs 10000-50000 evaluations of the objective function. The objective function is stateful (each invokation depends on all the previous ones).
The objective function has 3 stages, stages 1a-1c can be run concurrently, stage 2 should run after stages 1a-1c, and stage 3 should run after stage 2.
Each stage calls dot product, 4 matrix vector products, and sums the results of all 5 calls. the right-hand vector operands of all calls are the same.
Matrix and vector sizes vary between 25x25 to 100x100.

Here is what I thought of doing:

  • For each sample, run all agents concurrently
  • write 2 kernels, or use cublas functions: one for matrix vector product and one for dot product
  • call 15 concurrent kernels at the beginning: one for each of the actions in stages 1a-1c.
  • wait for all kernels to end
  • call 5 concurrent kernels which run stage 2
  • wait for all kernels to finish
  • call 5 concurrent kernels which run stage 3

regarding the kernel design itself - The only thing I could find was this:
http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBYQFjAA&url=http%3A%2F%2Fwww.worldscinet.com%2Fppl%2Fmkt%2Fpreserved-docs%2F1804%2FS0129626408003545.pdf&rct=j&q=cuda%20matrix%20vector%20multiplication&ei=Ev3oTO2aEsjtsgbPqfWbCw&usg=AFQjCNGycTBTbWZGXKwNJmraDxEkGJpZ4w&cad=rja

This is totally unreadable… I also applied as a registered developer to try to look at the cublas source code.
I am also not sure if I should use memory optimizations. The only thing I am pretty sure about doing is to place the right-hand vector operands in constant memory since all the participating threads should read them. I am also not sure about the sizes of blocks.

maybe you want to use cublasSetKernelStream, which is available since CUBLAS 3.1