I am looking for a way to measure the device memory access and some parts of my kernel code in general. I found out some ways to do it such as Create CPU timers or GPU timers and tune my kernel only with the parts that i wanna measure. However, there should be another precise method to do that inside the kernel.
I don not have experience with Nexus because I use linux but with CudaVisualProfiler you can get the estimate throughtput which is not bad, but I am looking for the most accurate one.
You can use the clock() method, which returns clock_t. It uses the internal cycle counter on the chip to measure times inside the kernel. Check out page 114 of the CUDA 2.3 programming guide (page 122 of the PDF).
Well, if you want to measure timing results inside the kernel without the profiler (e.g. at runtime, on a customer’s machine) the clock() method is the only way to go.
You need to be a little cautious in assuming that the PTX code reflects reality. PTX is only an intermediate representation of what really happens on the GPU, there is still scope for further optimization and reordering of instructions at runtime. I am fairly sure that someone (probably Sylvian Collange) has previously demonstrated that runtime reordering of clock() calls can happen in the final instruction stream that hits the silicon.
Sure! you right the ptx is only an approximation code. However, I have cheked the resutls and they make sense and also I have checked the decuda code which more similar than the PTX to the final code.
Anyways, can you tell me the paper where it is demonstrate that?