Alternatives to handle CUDA! Which alternative is the best?

Please take a look at 2 sample projects in CUDA SDK: matrixmul and matrixmul_drv.
It seems that both projects use the same configuration, but the second alternative performs faster than the first one.
I would appreciate if you could tell me why this happens. :)


we haven’t timed these tests for various matrix sizes because this code is intended as a learning example, not as a demonstration of high-performance matrix-matrix multiply.

That said, for small matrix sizes (as the default is), it makes sense that a larger portion of the overall time is due to CPU overhead, which is a tad smaller when using the driver API (matrixMulDrv) than when using the runtime API (matrixMul) by definition.

For large matrix sizes, the performance should be very similar between the two because the kernels are identical in both samples.


Thank you very much! External Image BTW I had thought about that, but is there some sort of proof or benchmark for this claim? Anyway, for a large matrix multiplication and linear algebra, do u recommend the CUBLAS functions instead of other alternatives?

I just want to point out to subtle mistake in matrixmul_drv.cpp, line 200.
If cutFindFilePath() returns zero pointer, flow control is correctly handed over to ‘Error’ label but ‘status’ variable still has value of ‘CUDA_SUCCESS’ which is inconsistent.