Custom profiling


I am trying to devise a custom embedded profiling skim for my application.
For that, I measure the clock cycles taken by certain portions of codes and want
to compare between CPU clocks and GPU clocks

What I have:
I measure CPU clock cycles using rdtscp asm instructions and GPU clock using clock64 instructions. For stable benchmarking, I fixed the CPU clock disabling turbo boost and speedstep and fixed the GPU frequency using nvidia-smi as presented here:

Now I need to get the CPU and GPU frequency to be able to normalize and compare results.
I was planning to use cudaDeviceAttrClockRate to get the GPU clock but it
always returns the same result (1.8ghz) regardless of the settings used in
nvidia-smi even though I do see changes in application total time.

I read about the different clocks (graphic clocks and shader clocks) but it is
not very clear to me which one should I get to be able to compare with the CPU
results and how can I get it.

Can anyone tell me how should I tackle this?
is there another way to get the frequency of the GPU?