Threading in CUDA


I have implemented SHA-256 and MD-5 on CUDA for my project. My analysis shows that GPU is faster than CPU even for serial algorithms. But my professor is not able to understand where in the algorithm I am getting high performance compared to CPU. As I increase Max Threads per block , performance increases.
As the algorithm has dependencies, can I know how to find where I am getting parallelism? or what should be the good answer to his question.

Bandwidth / bus width and the ALUs can be the reasons for the logical and arithmetic operations. But still he is not happy with the answer.

THank you.

How do you have a serial algorithm in cuda?

It’s parallel algorithm with serial steps… Like in MD5 we have dependency between each rounds, but we can run each round in parallel…

That’s what I thought. So, wouldn’t it make sense that if steps in the serial code are run in parallel (on the GPU), it would run faster than one CPU? I don’t guess I understand your prof’s position.