kernel runs much faster when being profiled with Visual Profiler

I am loading and executing a ptx compute_20 kernel using the cuda v5.5 driver api on a quadro 4000 card with driver version 331.65 (complete sys info is listed below).

when this is done under control of the visual profiler, kernel execution is significantly faster.

Results are correct in both cases.

I am at a loss to explain the performance difference. The kernel is doing many reads from a 2d texture and writing to a 2d surface. The surface is being shared with opengl.

This is similar to a posting on stackexchange a few months ago (item 16555528/why-does-cuda-code-run-so-much-faster-in-nvidia-visual-profiler).

Regards,
-Jim TenBrink
jtenbrink@esri.com


NVIDIA System Information report created on: 01/02/2014 15:17:40
System name: NSN2

[Display]
Operating System: Windows Server 2008 R2 Standard, 64-bit (Service Pack 1)
DirectX version: 11.0
GPU processor: Quadro 4000
Driver version: 331.65
Direct3D API version: 11
Direct3D feature level: 11_0
CUDA Cores: 256
Core clock: 475 MHz
Shader clock: 950 MHz
Memory data rate: 2808 MHz
Memory interface: 256-bit
Memory bandwidth: 89.86 GB/s
Total available graphics memory: 7934 MB
Dedicated video memory: 2048 MB GDDR5
System video memory: 0 MB
Shared system memory: 5886 MB
Video BIOS version: 70.00.37.00.03
IRQ: Not used
Bus: PCI Express x16 Gen2
Device Id: 10DE 06DD 078010DE
Part Number: 1031 0500
GPU processor: Quadro 4000
Driver version: 331.65
Direct3D API version: 11
Direct3D feature level: 11_0
CUDA Cores: 256
Core clock: 475 MHz
Shader clock: 950 MHz
Memory data rate: 2808 MHz
Memory interface: 256-bit
Memory bandwidth: 89.86 GB/s
Total available graphics memory: 7934 MB
Dedicated video memory: 2048 MB GDDR5
System video memory: 0 MB
Shared system memory: 5886 MB
Video BIOS version: 70.00.2F.00.12
IRQ: Not used
Bus: PCI Express x16 Gen2
Device Id: 10DE 06DD 078010DE
Part Number: 1031 0500

[Components]

easyDaemonAPIU64.DLL 1.14.17.0 NVIDIA Update Components
WLMerger.exe 1.14.17.0 NVIDIA Update Components
daemonu.exe 1.14.17.0 NVIDIA Update Components
ComUpdatus.exe 1.14.17.0 NVIDIA Update Components
NvUpdtr.dll 1.14.17.0 NVIDIA Update Components
NvUpdt.dll 1.14.17.0 NVIDIA Update Components
nvui.dll 8.17.13.3165 NVIDIA User Experience Driver Component
nvxdsync.exe 8.17.13.3165 NVIDIA User Experience Driver Component
nvxdplcy.dll 8.17.13.3165 NVIDIA User Experience Driver Component
nvxdbat.dll 8.17.13.3165 NVIDIA User Experience Driver Component
nvxdapix.dll 8.17.13.3165 NVIDIA User Experience Driver Component
NVCPL.DLL 8.17.13.3165 NVIDIA User Experience Driver Component
nvCplUIR.dll 7.5.780.0 NVIDIA Control Panel
nvCplUI.exe 7.5.780.0 NVIDIA Control Panel
nvWSSR.dll 6.14.13.3165 NVIDIA Workstation Server
nvWSS.dll 6.14.13.3165 NVIDIA Workstation Server
nvViTvSR.dll 6.14.13.3165 NVIDIA Video Server
nvViTvS.dll 6.14.13.3165 NVIDIA Video Server
nvDispSR.dll 6.14.13.3165 NVIDIA Display Server
NVMCTRAY.DLL 8.17.13.3165 NVIDIA Media Center Library
nvDispS.dll 6.14.13.3165 NVIDIA Display Server
NVCUDA.DLL 8.17.13.3165 NVIDIA CUDA 6.0.1 driver
nvGameSR.dll 6.14.13.3165 NVIDIA 3D Settings Server
nvGameS.dll 6.14.13.3165 NVIDIA 3D Settings Server

follow on note:

I’ve changed the power management mode in the nvidia control panel to “prefer maximum performance”, but this doesn’t seem to improve the performance of the kernel when it is not run under profiler control.

-jt

follow on note #2:

The opencl version of the app shows no performance problems when run directly.

-jt

Hi jt,

I have same problem, did you figure out ??

One of the NVIDIA engineers has responded on another thread, which sounds like its the same issue.

Please check this response: [url]Performance is much better when profling with NSight than when running production code - CUDA Programming and Performance - NVIDIA Developer Forums - and respond the solution works for you.
Thanks,