Well, the work I have been doing was originally prototyped in Matlab and Python. I think Matlab is easier than Python, and the speed difference is huge.

I understand that Python is the trendy language now, but for my work seconds do matter. Some applications which took 8 hours in Python now take a couple minutes in C/CUDA .

Lets compare some C/C++ CUDA code with the Python equivalent. My version of the Floyd-Warshall all-pairs shortest path algorithm is a good choice because it does the outer loop on the CPU and the rest on a 680 gtx. (n^3)

This implementation has a running time of 163 seconds for a dense random 10,000 x 10,000 adjacency matrix, which includes all memory alloc times and copies. The cpu verison in C++ runs at about 3700 seconds, and I can just imagine how long the Python version takes.

This is a really simple algorithm and here is the source code;

https://github.com/OlegKonings/CUDA_Floyd_Warshall_/blob/master/WikiGraphCuda/WikiGraphCuda/WGCmain.cu

It also stores the optimal paths, so the memory requirements are large, but should work with a decent GPU.

If anybody can even get close to the 2 min C++/CUDA time I will be amazed. This is not even the best example, just a good simple test. Please post your Python cpu results and the PyCuda results.

The linear algebra stuff has an even bigger time difference.

Maybe it is just as fast, but I doubt it.