Why cudaStream in Titan V is slower than P4000?

Hi there,

I executed my cuda code on Titan V (cuda 9.2) and P4000 (cuda 10.1). cudaStream is mainly used in this program. But the performance is so surprising me, the execution time on Titan V (~13.5s) is slower than P4000 (~4s). The nvprof log are the following.
Thanks for the help.

Titan V

==29799== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   20.63%  10.7626s         1  10.7626s  10.7626s  10.7626s  dissociation_excitedwater_b1a1(float*, float*, float*, int*, int*)
                   20.62%  10.7613s         1  10.7613s  10.7613s  10.7613s  dissociation_excitedwater_rd(float*, float*, float*, int*, int*)
                   20.59%  10.7459s         1  10.7459s  10.7459s  10.7459s  dissociation_dissociativewater(float*, float*, float*, int*, int*)
                   17.15%  8.94757s         1  8.94757s  8.94757s  8.94757s  dissociation_excitedwater_a1b1(float*, float*, float*, int*, int*)
                   16.41%  8.56250s         1  8.56250s  8.56250s  8.56250s  dissociation_ionizedwater(float*, float*, float*, int*, int*)
                    4.58%  2.39087s         1  2.39087s  2.39087s  2.39087s  thermalisation_subexelectrons(float*, float*, float*, float*, int*, int*, int*)
                    0.01%  7.7554ms         1  7.7554ms  7.7554ms  7.7554ms  setupcuseed(void)
                    0.00%  620.70us        10  62.069us  1.5360us  106.43us  [CUDA memcpy DtoH]
                    0.00%  560.60us        46  12.187us  1.1200us  96.543us  [CUDA memcpy HtoD]
                    0.00%  17.312us         1  17.312us  17.312us  17.312us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*>, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long>(thrust::device_ptr<int>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, float, thrust::device_ptr<float>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type)
                    0.00%  4.7990us         2  2.3990us  2.2720us  2.5270us  [CUDA memcpy HtoA]
                    0.00%  1.7600us         1  1.7600us  1.7600us  1.7600us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::InitAgent<thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*, int>, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long, int*>(bool=1, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*)
      API calls:   78.00%  10.9146s         5  2.18292s  6.0800us  10.9145s  cudaDeviceSynchronize
                   19.55%  2.73574s         2  1.36787s  4.4130us  2.73573s  cudaStreamSynchronize
                    2.29%  320.17ms         8  40.022ms  5.0950us  318.86ms  cudaMalloc
                    0.06%  7.7562ms         1  7.7562ms  7.7562ms  7.7562ms  cudaThreadSynchronize
                    0.04%  6.2802ms        10  628.02us  29.930us  1.6997ms  cudaMemcpyAsync
                    0.01%  1.9322ms       192  10.063us     367ns  414.20us  cuDeviceGetAttribute
                    0.01%  1.8378ms         9  204.20us  22.964us  1.4845ms  cudaLaunchKernel
                    0.01%  1.0573ms         1  1.0573ms  1.0573ms  1.0573ms  cudaDeviceReset
                    0.01%  852.25us        10  85.225us  23.150us  130.74us  cudaMemcpy
                    0.01%  707.79us         8  88.474us  1.3040us  207.03us  cudaFree
                    0.00%  558.08us         2  279.04us  263.68us  294.40us  cuDeviceTotalMem
                    0.00%  311.59us        36  8.6550us  6.4330us  56.632us  cudaMemcpyToSymbol
                    0.00%  166.86us         2  83.431us  81.136us  85.726us  cuDeviceGetName
                    0.00%  120.54us         2  60.272us  3.1980us  117.35us  cudaMallocArray
                    0.00%  101.77us         2  50.883us  8.4390us  93.328us  cudaFreeArray
                    0.00%  38.540us         5  7.7080us  4.9520us  17.549us  cudaStreamCreate
                    0.00%  30.546us         5  6.1090us  4.2140us  11.923us  cudaStreamDestroy
                    0.00%  19.932us         2  9.9660us  6.6230us  13.309us  cudaFuncGetAttributes
                    0.00%  15.558us         2  7.7790us  5.5440us  10.014us  cudaMemcpyToArray
                    0.00%  13.933us         1  13.933us  13.933us  13.933us  cudaSetDevice
                    0.00%  5.6580us         2  2.8290us  1.2760us  4.3820us  cudaBindTextureToArray
                    0.00%  5.3770us         2  2.6880us  1.1480us  4.2290us  cudaUnbindTexture
                    0.00%  5.2600us         2  2.6300us  1.9520us  3.3080us  cuDeviceGetPCIBusId
                    0.00%  4.2110us         4  1.0520us     460ns  2.0690us  cudaGetDevice
                    0.00%  3.7030us         4     925ns     441ns  1.6860us  cudaDeviceGetAttribute
                    0.00%  3.4030us         3  1.1340us     396ns  2.0830us  cuDeviceGetCount
                    0.00%  3.1960us         4     799ns     419ns  1.4720us  cuDeviceGet
                    0.00%  1.4940us         3     498ns     270ns     825ns  cudaGetLastError
                    0.00%     989ns         4     247ns     181ns     352ns  cudaPeekAtLastError
                    0.00%     929ns         2     464ns     341ns     588ns  cudaGetChannelDesc

P4000

==29339== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   25.39%  1.92013s         1  1.92013s  1.92013s  1.92013s  dissociation_ionizedwater(float*, float*, float*, int*, int*)
                   19.83%  1.50023s         1  1.50023s  1.50023s  1.50023s  dissociation_excitedwater_b1a1(float*, float*, float*, int*, int*)
                   17.68%  1.33765s         1  1.33765s  1.33765s  1.33765s  dissociation_excitedwater_rd(float*, float*, float*, int*, int*)
                   13.16%  995.54ms         1  995.54ms  995.54ms  995.54ms  dissociation_excitedwater_a1b1(float*, float*, float*, int*, int*)
                   12.01%  908.45ms         1  908.45ms  908.45ms  908.45ms  dissociation_dissociativewater(float*, float*, float*, int*, int*)
                   11.55%  873.78ms         1  873.78ms  873.78ms  873.78ms  thermalisation_subexelectrons(float*, float*, float*, float*, int*, int*, int*)
                    0.34%  25.885ms         1  25.885ms  25.885ms  25.885ms  setupcuseed(void)
                    0.01%  1.1205ms        10  112.05us     768ns  179.49us  [CUDA memcpy DtoH]
                    0.01%  958.00us        46  20.825us     608ns  174.40us  [CUDA memcpy HtoD]
                    0.00%  47.073us         1  47.073us  47.073us  47.073us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*>, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long>(thrust::device_ptr<int>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, float, thrust::device_ptr<float>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type)
                    0.00%  3.9360us         2  1.9680us  1.9520us  1.9840us  [CUDA memcpy HtoA]
                    0.00%  1.0880us         1  1.0880us  1.0880us  1.0880us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::InitAgent<thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*, int>, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long, int*>(bool=1, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*)
      API calls:   73.19%  2.85564s         5  571.13ms  1.2560us  2.85558s  cudaDeviceSynchronize
                   23.24%  906.68ms         2  453.34ms  1.9820us  906.67ms  cudaStreamSynchronize
                    2.71%  105.71ms         8  13.213ms  5.2270us  105.31ms  cudaMalloc
                    0.66%  25.888ms         1  25.888ms  25.888ms  25.888ms  cudaThreadSynchronize
                    0.07%  2.7327ms        10  273.27us  15.910us  580.26us  cudaMemcpyAsync
                    0.07%  2.5458ms         9  282.87us  8.8240us  2.3514ms  cudaLaunchKernel
                    0.03%  1.1119ms        10  111.19us  52.750us  214.12us  cudaMemcpy
                    0.01%  367.12us         8  45.889us  1.9290us  171.02us  cudaFree
                    0.01%  305.01us         1  305.01us  305.01us  305.01us  cudaDeviceReset
                    0.01%  231.17us        36  6.4210us  4.4840us  40.333us  cudaMemcpyToSymbol
                    0.00%  148.84us         1  148.84us  148.84us  148.84us  cuDeviceTotalMem
                    0.00%  104.93us        97  1.0810us     100ns  44.627us  cuDeviceGetAttribute
                    0.00%  92.328us         2  46.164us  3.6840us  88.644us  cudaMallocArray
                    0.00%  76.192us         2  38.096us  6.1380us  70.054us  cudaFreeArray
                    0.00%  47.226us         1  47.226us  47.226us  47.226us  cuDeviceGetName
                    0.00%  15.776us         2  7.8880us  5.1180us  10.658us  cudaMemcpyToArray
                    0.00%  13.963us         5  2.7920us  1.5340us  7.0260us  cudaStreamDestroy
                    0.00%  12.460us         5  2.4920us  1.0860us  7.9290us  cudaStreamCreate
                    0.00%  8.6190us         2  4.3090us  1.9910us  6.6280us  cudaFuncGetAttributes
                    0.00%  4.7870us         2  2.3930us  1.4890us  3.2980us  cudaBindTextureToArray
                    0.00%  3.5200us         2  1.7600us     695ns  2.8250us  cudaUnbindTexture
                    0.00%  2.8140us         1  2.8140us  2.8140us  2.8140us  cudaSetDevice
                    0.00%  2.4750us         1  2.4750us  2.4750us  2.4750us  cuDeviceGetPCIBusId
                    0.00%  1.7220us         4     430ns     226ns     966ns  cudaGetDevice
                    0.00%  1.4440us         4     361ns     204ns     764ns  cudaDeviceGetAttribute
                    0.00%  1.1200us         3     373ns     113ns     852ns  cuDeviceGetCount
                    0.00%  1.0260us         2     513ns     114ns     912ns  cuDeviceGet
                    0.00%     961ns         2     480ns     380ns     581ns  cudaGetChannelDesc
                    0.00%     440ns         4     110ns      80ns     159ns  cudaPeekAtLastError
                    0.00%     196ns         1     196ns     196ns     196ns  cuDeviceGetUuid
                    0.00%     146ns         1     146ns     146ns     146ns  cudaGetLastError

The following assumes that both runs involved identical workloads.

The huge difference in runtime of your dissociation kernels suggests that you were running a debug build on the Titan V. Double check your build settings, also with regard to target architecture.

The difference in CUDA API execution times suggests the two GPUs are in two different host systems, where the P4000 is in the host system with higher single-thread performance.

Thanks for the answer.
I recompiled my code again with
Titan V
nvcc main.cu -arch=sm_70 -rdc=true -lcudadevrt -o disso

P4000
nvcc main.cu -arch=sm_61 -rdc=true -lcudadevrt -o disso

Executed both programs with
sudo nice -n -20 ./disso

The GPU execution times are ~13s and ~2.5s for Titan V and P4000, respectively.
The new logs are the following.
Thanks.

Titan V

==18726== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   19.71%  9.98526s         1  9.98526s  9.98526s  9.98526s  dissociation_excitedwater_b1a1(float*, float*, float*, int*, int*)
                   19.70%  9.98491s         1  9.98491s  9.98491s  9.98491s  dissociation_excitedwater_rd(float*, float*, float*, int*, int*)
                   19.66%  9.96174s         1  9.96174s  9.96174s  9.96174s  dissociation_dissociativewater(float*, float*, float*, int*, int*)
                   18.20%  9.22393s         1  9.22393s  9.22393s  9.22393s  dissociation_excitedwater_a1b1(float*, float*, float*, int*, int*)
                   18.17%  9.20595s         1  9.20595s  9.20595s  9.20595s  dissociation_ionizedwater(float*, float*, float*, int*, int*)
                    4.54%  2.30293s         1  2.30293s  2.30293s  2.30293s  thermalisation_subexelectrons(float*, float*, float*, float*, int*, int*, int*)
                    0.02%  7.7629ms         1  7.7629ms  7.7629ms  7.7629ms  setupcuseed(void)
                    0.00%  588.12us        10  58.812us  1.4400us  97.983us  [CUDA memcpy DtoH]
                    0.00%  562.14us        46  12.220us  1.1190us  96.640us  [CUDA memcpy HtoD]
                    0.00%  17.792us         1  17.792us  17.792us  17.792us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*>, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long>(thrust::device_ptr<int>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, float, thrust::device_ptr<float>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type)
                    0.00%  5.1200us         2  2.5600us  2.3360us  2.7840us  [CUDA memcpy HtoA]
                    0.00%  1.7600us         1  1.7600us  1.7600us  1.7600us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::InitAgent<thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*, int>, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long, int*>(bool=1, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*)
      API calls:   78.62%  10.0553s         5  2.01106s  5.3730us  10.0552s  cudaDeviceSynchronize
                   18.45%  2.35906s         2  1.17953s  3.3350us  2.35906s  cudaStreamSynchronize
                    2.75%  351.88ms         8  43.985ms  5.8540us  350.52ms  cudaMalloc
                    0.06%  7.7635ms         1  7.7635ms  7.7635ms  7.7635ms  cudaThreadSynchronize
                    0.05%  6.0832ms        10  608.32us  24.186us  1.4035ms  cudaMemcpyAsync
                    0.03%  3.8888ms         9  432.09us  15.713us  3.6557ms  cudaLaunchKernel
                    0.01%  1.6233ms       192  8.4540us     193ns  530.34us  cuDeviceGetAttribute
                    0.01%  852.96us        10  85.296us  24.175us  129.13us  cudaMemcpy
                    0.01%  748.30us         1  748.30us  748.30us  748.30us  cudaDeviceReset
                    0.01%  681.99us         8  85.248us  1.4130us  174.15us  cudaFree
                    0.00%  369.86us         2  184.93us  178.08us  191.78us  cuDeviceTotalMem
                    0.00%  327.43us        36  9.0950us  6.4800us  57.332us  cudaMemcpyToSymbol
                    0.00%  209.21us         2  104.60us  3.2460us  205.96us  cudaMallocArray
                    0.00%  142.23us         2  71.116us  52.551us  89.682us  cuDeviceGetName
                    0.00%  118.65us         2  59.322us  8.6490us  110.00us  cudaFreeArray
                    0.00%  45.158us         5  9.0310us  5.6080us  21.846us  cudaStreamCreate
                    0.00%  30.984us         5  6.1960us  4.3440us  13.330us  cudaStreamDestroy
                    0.00%  15.096us         2  7.5480us  5.1720us  9.9240us  cudaFuncGetAttributes
                    0.00%  15.080us         2  7.5400us  5.2040us  9.8760us  cudaMemcpyToArray
                    0.00%  9.4240us         1  9.4240us  9.4240us  9.4240us  cudaSetDevice
                    0.00%  5.2200us         2  2.6100us  1.4120us  3.8080us  cudaBindTextureToArray
                    0.00%  4.7870us         2  2.3930us  1.8710us  2.9160us  cuDeviceGetPCIBusId
                    0.00%  3.7730us         2  1.8860us     660ns  3.1130us  cudaUnbindTexture
                    0.00%  3.0660us         4     766ns     410ns  1.4650us  cudaGetDevice
                    0.00%  2.6300us         4     657ns     397ns  1.0350us  cudaDeviceGetAttribute
                    0.00%  1.5850us         4     396ns     232ns     609ns  cuDeviceGet
                    0.00%  1.4950us         3     498ns     249ns     817ns  cuDeviceGetCount
                    0.00%  1.3180us         3     439ns     220ns     739ns  cudaGetLastError
                    0.00%  1.0500us         4     262ns     180ns     417ns  cudaPeekAtLastError
                    0.00%     878ns         2     439ns     331ns     547ns  cudaGetChannelDesc

P4000

==13208== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   28.39%  1.15618s         1  1.15618s  1.15618s  1.15618s  dissociation_ionizedwater(float*, float*, float*, int*, int*)
                   18.39%  748.87ms         1  748.87ms  748.87ms  748.87ms  dissociation_excitedwater_b1a1(float*, float*, float*, int*, int*)
                   16.78%  683.45ms         1  683.45ms  683.45ms  683.45ms  dissociation_excitedwater_rd(float*, float*, float*, int*, int*)
                   12.24%  498.51ms         1  498.51ms  498.51ms  498.51ms  dissociation_excitedwater_a1b1(float*, float*, float*, int*, int*)
                   12.09%  492.21ms         1  492.21ms  492.21ms  492.21ms  thermalisation_subexelectrons(float*, float*, float*, float*, int*, int*, int*)
                   11.42%  465.14ms         1  465.14ms  465.14ms  465.14ms  dissociation_dissociativewater(float*, float*, float*, int*, int*)
                    0.63%  25.841ms         1  25.841ms  25.841ms  25.841ms  setupcuseed(void)
                    0.03%  1.0857ms        10  108.57us     736ns  171.59us  [CUDA memcpy DtoH]
                    0.02%  965.03us        46  20.978us     640ns  176.35us  [CUDA memcpy HtoD]
                    0.00%  53.664us         1  53.664us  53.664us  53.664us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*>, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, thrust::zip_iterator<thrust::tuple<thrust::detail::normal_iterator<thrust::device_ptr<int>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>>, thrust::detail::unary_negate<first_element_equal_255>, int, int*, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long>(thrust::device_ptr<int>, thrust::detail::normal_iterator<thrust::device_ptr<int>>, float, thrust::device_ptr<float>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::detail::normal_iterator<thrust::device_ptr<float>>, thrust::null_type)
                    0.00%  4.2560us         2  2.1280us  2.0480us  2.2080us  [CUDA memcpy HtoA]
                    0.00%  1.0880us         1  1.0880us  1.0880us  1.0880us  void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::InitAgent<thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*, int>, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, unsigned long, int*>(bool=1, thrust::cuda_cub::cub::ScanTileState<int, bool=1>, int*)
      API calls:   71.64%  1.69803s         5  339.61ms  1.3060us  1.69797s  cudaDeviceSynchronize
                   22.39%  530.80ms         2  265.40ms  2.0240us  530.80ms  cudaStreamSynchronize
                    4.55%  107.86ms         8  13.482ms  5.2050us  107.45ms  cudaMalloc
                    1.09%  25.845ms         1  25.845ms  25.845ms  25.845ms  cudaThreadSynchronize
                    0.12%  2.7501ms        10  275.01us  16.407us  572.43us  cudaMemcpyAsync
                    0.11%  2.5503ms         9  283.36us  9.3530us  2.3500ms  cudaLaunchKernel
                    0.05%  1.1415ms        10  114.15us  57.187us  213.04us  cudaMemcpy
                    0.01%  328.95us         8  41.118us  1.7780us  129.74us  cudaFree
                    0.01%  261.57us         1  261.57us  261.57us  261.57us  cudaDeviceReset
                    0.01%  245.38us        36  6.8150us  4.5160us  49.536us  cudaMemcpyToSymbol
                    0.01%  155.81us         1  155.81us  155.81us  155.81us  cuDeviceTotalMem
                    0.00%  104.92us        97  1.0810us      98ns  44.604us  cuDeviceGetAttribute
                    0.00%  97.157us         2  48.578us  3.7980us  93.359us  cudaMallocArray
                    0.00%  77.097us         2  38.548us  6.6540us  70.443us  cudaFreeArray
                    0.00%  26.781us         1  26.781us  26.781us  26.781us  cuDeviceGetName
                    0.00%  16.503us         2  8.2510us  5.1190us  11.384us  cudaMemcpyToArray
                    0.00%  15.346us         5  3.0690us  1.0890us  10.664us  cudaStreamCreate
                    0.00%  14.243us         5  2.8480us  1.5860us  7.5360us  cudaStreamDestroy
                    0.00%  8.7160us         2  4.3580us  1.9240us  6.7920us  cudaFuncGetAttributes
                    0.00%  5.3440us         2  2.6720us  1.6420us  3.7020us  cudaBindTextureToArray
                    0.00%  3.3900us         2  1.6950us     678ns  2.7120us  cudaUnbindTexture
                    0.00%  3.0230us         1  3.0230us  3.0230us  3.0230us  cuDeviceGetPCIBusId
                    0.00%  2.6780us         1  2.6780us  2.6780us  2.6780us  cudaSetDevice
                    0.00%  1.8190us         4     454ns     232ns  1.0240us  cudaGetDevice
                    0.00%  1.5360us         4     384ns     224ns     844ns  cudaDeviceGetAttribute
                    0.00%  1.3240us         3     441ns      93ns  1.0710us  cuDeviceGetCount
                    0.00%     993ns         2     496ns     396ns     597ns  cudaGetChannelDesc
                    0.00%     565ns         2     282ns     108ns     457ns  cuDeviceGet
                    0.00%     457ns         4     114ns      82ns     147ns  cudaPeekAtLastError
                    0.00%     189ns         1     189ns     189ns     189ns  cuDeviceGetUuid
                    0.00%     172ns         1     172ns     172ns     172ns  cudaGetLastError
  1. Are these cards in the same system?
  2. Did you do any warmup runs before timing the kernels?
  1. Nope, the platform of Titan V is
    > Intel(R) Core™ i9-7900X CPU @ 3.30GHz
    > with 64 GB memory (total 62, used 1, free 25, shared 0, buff/cache 35, available 59 by free -g)
    platform of P4000 is
    > Intel(R) Core™ i7-7700 CPU @ 3.60GHz
    > with 16 GB memory (total 15, used 1, free 1, shared 0, buff/cache 11, available 12)
  2. Program on Titan V repeated for 10 times, the execution times in sec are
    10.647979
    10.856031
    10.202501
    12.586870
    11.040109
    12.059604
    13.879885
    12.238625
    12.532560
    12.490000
    while only 2 to 3 sec. on P4000
    By the way, the kernel behaviors are very different.
    Titan V
    https://imgur.com/0u0tsEY
    P4000
    https://imgur.com/euTe6j0

Thanks

perhaps you should upgrade to CUDA 10.1 on the Titan V system.

I would also check the Titan V system for thermal issues. If the Titan V is overheating, you could witness the behavior you are showing.

I checked with nvidia-smi, the temperature should be fine.
It’s OK to upgrade to cuda 10? The recommended CUDA version(s) for Titan V is cuda 9, isn’t it?
https://www.nvidia.com/Download/driverResults.aspx/128000/en-us

Thanks.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 396.37 CUDA Version: ERR! |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:17:00.0 Off | N/A |
| 33% 48C P0 30W / 250W | 11MiB / 12066MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 TITAN V Off | 00000000:65:00.0 Off | N/A |
| 35% 51C P2 45W / 250W | 1289MiB / 12065MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 2527 C ./disso 1278MiB |
±----------------------------------------------------------------------------+

Did you verify that the application does in fact consume the exact same input data and produce the same output data on both platforms?

Have you run the app with cuda-memcheck to make sure it reports no race conditions, accesses out of bounds, etc?

Does the app contain data-dependent iteration constructs (e.g. some kind of converging computation)? If so, have you checked whether the number of iterations matches between the platforms?

The inputs were exactly the same on two platforms, the results were reasonable but cannot be same because it is a Monte Carlo program.
No error with cuda-memcheck
Repeated over 100 times for both platforms and got the same performance differences.

Thanks.