Threading in CUDA

Hello,

I have implemented SHA-256 and MD-5 on CUDA for my project. My analysis shows that GPU is faster than CPU even for serial algorithms. But my professor is not able to understand where in the algorithm I am getting high performance compared to CPU. As I increase Max Threads per block , performance increases.
As the algorithm has dependencies, can I know how to find where I am getting parallelism? or what should be the good answer to his question.

Bandwidth / bus width and the ALUs can be the reasons for the logical and arithmetic operations. But still he is not happy with the answer.

THank you.
Manuj

How do you have a serial algorithm in cuda?

It’s parallel algorithm with serial steps… Like in MD5 we have dependency between each rounds, but we can run each round in parallel…

That’s what I thought. So, wouldn’t it make sense that if steps in the serial code are run in parallel (on the GPU), it would run faster than one CPU? I don’t guess I understand your prof’s position.