I have implemented SHA-256 and MD-5 on CUDA for my project. My analysis shows that GPU is faster than CPU even for serial algorithms. But my professor is not able to understand where in the algorithm I am getting high performance compared to CPU. As I increase Max Threads per block , performance increases.
As the algorithm has dependencies, can I know how to find where I am getting parallelism? or what should be the good answer to his question.
Bandwidth / bus width and the ALUs can be the reasons for the logical and arithmetic operations. But still he is not happy with the answer.