Floow
March 28, 2012, 8:34am
1
Hello,
I am improving my CUDA application and after using Visual profiler I have some questions
I use syncthreads() inside the kernel, typically between each function I call.
Is it important to have these syncthreads() and cudaThreadSynchronize() ? Or can I avoid them? What are the efficiency issues?
I also have warning from Visual profiler : ‘Low Memcpy/Compute Overlap’ / ‘Low Memcpy throughput’
Do I have to use cudaMemcpyAsync to correct this? Maybe like that:
cudaMemsetAsync (∗ devPtr, value, count, stream1)
Kernel<<<gridSize, blockSize>>>();
cudaMemcpyAsync (∗ dst, ∗ src, count, cudaMemcpyKind, stream1 );
cudaMemcpyAsync (∗ dst, ∗ src, count, cudaMemcpyKind, stream2 );
But can I be sure that all the data in the kernel are correct? I mean will the memset be finish before the use of the data?
And is it possible to see the execution time of each function inside the kernel, at least for one thread?
I have a lot of question and I hope you will be able to understand and help me.
Dext
Floow
March 30, 2012, 12:24pm
2
How can I see details of kernel execution? I would like to know the execution time of each functions called in the kernel
Thanks
tera
April 2, 2012, 10:21am
3
I don’t know of any tool that does this automatically, but you can manually instrument your kernel to gather execution times of relevant parts using clock() or clock64().
Floow
April 2, 2012, 11:58am
4
Yes I did it, but I discussed with some people and they think it is stupid to have only the time of the total kernel execution in the profiler. And I agree because it is difficult to optimize and find bottleneck with only a “global” execution time.
We have to implement manually the compute of execution times.
I am wondering one stupid question, what is Visual Profiler use for? Compute bandwidth and this kind of stuff?
here is the complete guide for what that is for
in a nutshell, a tool to profile your application, one use is to see how the resources are being used and do optimizations/improvements accordingly.
Yes I did it, but I discussed with some people and they think it is stupid to have only the time of the total kernel execution in the profiler. And I agree because it is difficult to optimize and find bottleneck with only a “global” execution time.
We have to implement manually the compute of execution times.
I am wondering one stupid question, what is Visual Profiler use for? Compute bandwidth and this kind of stuff?
Floow
April 10, 2012, 10:05am
6
Thanks, I will look at this when I will have time.