Is CUDA better than GLSLang? I need to know more...

Which matrix-vector multiplication implementation were you using? Have you tried Cublas cublasSgemv? Also, which GPU were you using?

Paulius

Yes indeed, I’m using a lot the capabilities of GLSLang to multiply mat4 * vec4. As I know, they’re highly optimized because in 3D graphics *4 is what is really needed (xyzw or rgba).

In my current research, for exmple, I multiply a triangular 16x16 matrix with 16 vector using mat4 submatrixes and vec4 vectors.

Isn’t CUDA specialized for this sort of multiplication of data?

Thanks,

Ema.

Not really. The G80 is a scalar processor which means there is no speedup in using the GLSL vector types instead of scalars. CUDA doesn’t currently even expose any vector math operations, although there seems to be room for them in the PTX ISA as well as in CUDA.

That is, things can be different in future devices, like they were in the previous generations of NVIDIA GPUs.

HTH.

/Pyry

Hi

I’ve not used cublasSgemv. I’ve developped a backprojection application for tomographic reconstruction on a GTS8800 board. Here it is a comparable work publicated: http://www.iop.org/EJ/abstract/0031-9155/52/12/006/

Backpropagation is very well suited to graphics pipeline and thus benefits from the built-in hardware. Our cuda version is just a simple solution with texture cache use and with no special optimization (without shared memory for instance), so it could be improved further.

Also, recently I’ve changed from 0.8 to 1.0 SDK and driver and the execution time is the double now (100 ms vs. 50 ms previously). Do you know a problem with 1.0 version?

Dominique

Can you post your timing results? What’s the type of the entris (byte, int, float)? I’d be interested in taking a look at CUDA code for matrix-vector multiply (it’d be great if you could provide a .cu file with the required kernel that I could call from my app). You can send these to me in a private message, if you dont’ want to post them publicly.

Assuming texture fetches have good 2D locality, using texture memory should be pretty efficient as it is cached.

I am willing to bet that Cublas implementation would be quite fast. I haven’t tried the matrix-vector, but matrix-matrix is blazing fast (exceeds 120GFlops in some cases).

Not sure about the doubling in time. How are you timing your code (how many reps are being averaged)? v1.0 is with a new driver, so if there’s such a regression it needs to be looked at.

Paulius

Hi

the matrix-vector part of my application is not particularly slow, what I mean is that we don’t need to program it in the case of the graphics pipeline which is already available as built-in hardware in the rasterization stage, thus we can benefit from it. My code is included in the whole tomographic application code which is large and can’t be used alone. I will try to use cublas to see the difference!

Our previous timing where only 1 rep (50ms). That was 200 GFlops (with good 2D locality)! I will go back to the 0.8 to extract more precise timings with the profiler

and also compare with other small codes and post them soon.

Dominique

I’ve compared the timings between 0.8 and 1.0. For small examples like those provided in the SDK, there is no significant difference. For a big example like our tomographic application, the timings are in fact better with 1.0: 127 ms with 0.8 and 101 ms with 1.0

(from profiler). Here are the results for 1.0 SDK:

method=[ memcopy ] gputime=[ 15696.353 ]

method=[ memcopy ] gputime=[ 43227.715 ]

method=[ memcopy ] gputime=[ 160139.531 ]

method=[ transformKernel ] gputime=[ 101717.156 ] cputime=[ 102129.461 ] occupancy=[ 0.417 ]

method=[ memcopy ] gputime=[ 19264.354 ]

Here are the results with 0.8 SDK:

method=[ memcopy ] gputime=[ 19855.744 ]

method=[ memcopy ] gputime=[ 47608.961 ]

method=[ memcopy ] gputime=[ 165386.063 ]

method=[ transformKernel ] gputime=[ 126922.500 ] cputime=[ 127196.891 ] occupancy=[ 0.500 ]

method=[ memcopy ] gputime=[ 19321.664 ]

Even the occupency is different!

Is the compiler better in 1.0???

concerning the memcopy from host to device, I copy 30 MBytes in 160 ms,

that is 190 MB/s which is not very fast. It could be 10 times better, no???

Dominique

It is possible that the newer compiler optimized register usage, which in turn allowed for more threadblocks to be run concurrently, thus improving occupancy.

Yes, you should be able to achieve much higher rates, even without pinned memory. Look at the throughput sample in the SDK, you may find hints there.

Paulius

Thanks!

I’ve tried the SDK bandwidthTest, the best I’ve obtained is 200MB/s for pinned memory,

185MB/s for pageable memory! (with Windows XP)

Have you better results?

Dominique

Which motherboard and PCIe slot are you using? The bandwidth sample should be able to achieve over 3GB/s with pinned memory. Some people got pretty much to the maximum PCIe throughput:
bandwitdh thread

Paulius

Thanks for the hint.

I’ve tried it but with the same result of 200MB/s! (with ASUS A8N-SLI board with PCIe 16x and AMD Athlon64 and Windows XP). I’ve change the slot as there are 2 PCIe 16x slots and now I’ve 3 GB/s!!! It’s incredible: having 2 slots and only one is really usable!

thanks

Dominique