I had some openCL code run on several Nvidia GPUs. As the energy consumption was less than expected I guess the programm does not push the hardware hard enough.
I would like to evaluate the openCL implementation before porting it to CUDA.
Is there a possibility to get information regarding used memory bandwidth, number active blocks and threads.
Is there a way of analysing the core in regards of optimizations?
How about after porting it to CUDA?