PyTorch's cpu() function call takes a lot of time on Jetson Xavier

Hello, I have been working on semantic segmentation for my project. I have been using Resnet18dilated-c1_deepsub network. For all process done on the GPU it works great but when I try to move the data from GPU to CPU memory using cpu() function it takes a lot of time. Without cpu() whole process runs on 30 fps but after calling cpu() it runs on 4 fps. Can you help me to find the exact cause of this problem?

Hi,

Would you mind to provide the information of following items?

1. How do you install pyTorch package? Is it from this topic?

2. What kind of data type do you use? torch.FloatTensor, torch.HalfTensor, …
https://pytorch.org/docs/stable/tensors.html

3. Have you fixed the device clock to the maximal?

sudo jetson_clocks

4. Could you execute the application with nvprof and share the data with us?

nvprof python [test.py]

Thanks.

  1. Yes I used the wheel file to install the pytorch(v1.1.0) on python-3.6
  2. Data type is float32
  3. Yes all maxed out
  4. Here u go-
    ==16528== NVPROF is profiling process 16528, command: …/env/bin/python3 sample_Resnet18.py
    ==16528== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
    Loading weights for net_encoder
    Loading weights for net_decoder

#Time taken to transfer output tensor from gpu to cpu for each image
0.06400306800787803
0.06884447298943996
0.0890714859997388
0.0840118610067293
0.08288904999790248
0.09249643399380147
0.08620240799791645
0.08774027999606915
0.08000665900181048
0.08611033800116275
0.08701102600025479

==16528== Profiling application: …/env/bin/python3 sample_Resnet18.py
==16528== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 47.22% 569.46ms 44 12.942ms 579.11us 17.275ms volta_scudnn_128x64_relu_small_nn_v1
20.18% 243.36ms 44 5.5308ms 4.3574ms 8.9647ms volta_scudnn_128x128_relu_small_nn_v1
18.21% 219.63ms 121 1.8152ms 607.11us 8.1826ms volta_scudnn_winograd_128x128_ldg1_ldg4_relu_tile148t_nt_v1
3.96% 47.786ms 264 181.01us 50.115us 1.2749ms void cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>(float, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>, cudnnTensorStruct, float const , float, cudnnTensorStruct, float, cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1> const , cudnn::detail::bn_fw_inf_1C11_kernel_new<float, float, bool=1, int=1>)
3.92% 47.325ms 220 215.11us 48.451us 1.8522ms ZN2at6native18elementwise_kernelILi512ELi1EZNS0_17gpu_binary_kernelIZNS0_21threshold_kernel_implIfEEvRNS_14TensorIteratorET_S6_EUlffE_EEvS5_RKS6_EUliE_EEviT1
1.39% 16.795ms 33 508.95us 95.942us 982.59us volta_scudnn_128x64_relu_interior_nn_v1
1.13% 13.681ms 88 155.47us 62.628us 669.32us ZN2at6native18elementwise_kernelILi512ELi1EZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIfEEvRNS_14TensorIteratorEN3c106ScalarEEUlffE_EEvS5_RKT_EUliE_EEviT1
1.06% 12.733ms 11 1.1575ms 735.57us 1.3059ms void MaxPoolForward<float, float>(int, float const , int, int, int, int, int, int, int, int, int, int, int, int, int, int, float, long
)
0.81% 9.7477ms 241 40.446us 256ns 1.1619ms [CUDA memcpy HtoD]
0.71% 8.5912ms 22 390.51us 60.740us 721.36us void cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, float const , int, float, cudnn::detail::implicit_convolve_sgemm<float, float, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>, kernel_conv_params, int, float, float, int, float, float, int, int)
0.45% 5.4838ms 11 498.53us 492.64us 503.68us void at::native::_GLOBAL__N__53_tmpxft_000062a1_00000000_8_SoftMax_compute_72_cpp1_ii_a3310042::cunn_SpatialSoftMaxForward<float, float, float, at::native::_GLOBAL__N__53_tmpxft_000062a1_00000000_8_SoftMax_compute_72_cpp1_ii_a3310042::SoftMaxForwardEpilogue>(float
, float*, unsigned int, unsigned int, unsigned int)
0.26% 3.1639ms 11 287.63us 283.83us 293.68us volta_scudnn_128x128_relu_interior_nn_v1
0.25% 2.9911ms 11 271.92us 269.62us 275.09us void caffe_gpu_interp2_kernel<float, float>(int, float, float, bool, THCDeviceTensor<float, int=4, int, DefaultPtrTraits>, THCDeviceTensor<float, int=4, int, DefaultPtrTraits>)
0.22% 2.5994ms 121 21.483us 6.2090us 73.061us void cudnn::winograd::generateWinogradTilesKernel<int=0, float, float>(cudnn::winograd::GenerateWinogradTilesParams<float, float>)
0.14% 1.6343ms 11 148.57us 146.83us 149.74us void kernelTransformReduceOuterDimIndex<float, long, MaxValuePair<float, long>>(float*, long*, float*, unsigned int, unsigned int, unsigned int, thrust::pair<float, long>, float)
0.02% 299.00us 11 27.181us 26.626us 27.874us ZN2at4cuda74_GLOBAL__N__50_tmpxft_00005d0f_00000000_8_Copy_compute_72_cpp1_ii_dd3fb9a321kernelPointwiseApply2IZN74_GLOBAL__N__50_tmpxft_00005d0f_00000000_8_Copy_compute_72_cpp1_ii_dd3fb9a36CopyOpIhlE5applyERNS_6TensorERKS6_EUlRhRKlE_hljLi1ELi1ELi1EEEvNS0_6detail10TensorInfoIT0_T2_EENSF_IT1_SH_EESH_T
0.02% 256.88us 77 3.3360us 2.1760us 4.6730us cudnn::cask::computeOffsetsKernel(cudnn::cask::ComputeOffsetsParams)
0.02% 182.70us 55 3.3210us 2.6560us 4.4480us cudnn::gemm::computeOffsetsKernel(cudnn::gemm::ComputeOffsetsParams)
0.01% 115.91us 11 10.537us 10.208us 10.849us [CUDA memcpy DtoH]
0.01% 85.540us 11 7.7760us 7.2960us 8.0320us void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
0.00% 11.361us 15 757ns 320ns 1.6320us [CUDA memset]
API calls: 57.64% 6.15056s 55 111.83ms 13.601us 6.10470s cudaMalloc
32.64% 3.48290s 8 435.36ms 36.866us 3.48263s cudaStreamCreateWithFlags
8.65% 922.84ms 251 3.6766ms 19.713us 91.896ms cudaMemcpyAsync
0.60% 63.532ms 1166 54.487us 28.225us 305.78us cudaLaunchKernel
0.21% 22.934ms 10725 2.1380us 1.5360us 100.93us cudaGetDevice
0.16% 17.473ms 5750 3.0380us 1.8880us 150.73us cudaSetDevice
0.03% 3.3354ms 539 6.1880us 4.0960us 212.40us cudaEventRecord
0.03% 2.9811ms 240 12.421us 5.7280us 143.40us cudaStreamSynchronize
0.01% 1.2251ms 15 81.674us 32.161us 183.56us cudaMemsetAsync
0.01% 786.99us 968 813ns 288ns 24.801us cudaGetLastError
0.01% 629.28us 22 28.603us 22.209us 50.946us cudaBindTexture
0.00% 333.07us 189 1.7620us 832ns 69.987us cuDeviceGetAttribute
0.00% 253.65us 1 253.65us 253.65us 253.65us cudaHostAlloc
0.00% 204.68us 4 51.170us 36.226us 65.667us cudaStreamCreateWithPriority
0.00% 184.68us 30 6.1560us 3.0400us 26.786us cudaFuncSetAttribute
0.00% 168.01us 22 7.6360us 4.4800us 37.185us cudaUnbindTexture
0.00% 152.23us 11 13.839us 9.2800us 37.602us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.00% 96.613us 1 96.613us 96.613us 96.613us cudaGetDeviceProperties
0.00% 83.524us 28 2.9830us 2.4960us 6.8810us cudaEventCreateWithFlags
0.00% 63.331us 1 63.331us 63.331us 63.331us cudaMemcpy
0.00% 46.276us 27 1.7130us 1.5360us 3.9050us cudaDeviceGetAttribute
0.00% 29.506us 14 2.1070us 480ns 12.289us cudaGetDeviceCount
0.00% 25.249us 2 12.624us 7.2960us 17.953us cuDeviceTotalMem
0.00% 11.681us 4 2.9200us 1.4730us 4.6400us cuDeviceGetCount
0.00% 7.6160us 1 7.6160us 7.6160us 7.6160us cudaHostGetDevicePointer
0.00% 7.0080us 2 3.5040us 3.2640us 3.7440us cudaFree
0.00% 5.5370us 1 5.5370us 5.5370us 5.5370us cudaDeviceGetStreamPriorityRange
0.00% 5.3770us 2 2.6880us 1.8250us 3.5520us cuDeviceGetName
0.00% 5.1210us 2 2.5600us 2.0810us 3.0400us cuDeviceGetUuid
0.00% 4.3200us 3 1.4400us 1.1520us 1.9520us cuDeviceGet
0.00% 3.7440us 1 3.7440us 3.7440us 3.7440us cuInit
0.00% 1.8880us 1 1.8880us 1.8880us 1.8880us cuDriverGetVersion

Hi,

Based on your log, cudaMemcpy occupied lots of time.

8.65% 922.84ms 251 3.6766ms 19.713us 91.896ms cudaMemcpyAsync

Not sure the copy is used from model input or output.
Would you mind to turn-off the xxx.cpu() and run the nvprof again?

By the way, would you also run the following cuda sample and share the result with us?
0_Simple/simpleMultiCopy
1_Utilities/bandwidthTest

Thanks.

I switched to HalfTensor in pytorch and I’m getting these results.
…WITH .cpu()

==13132== NVPROF is profiling process 13132, command: …/env/bin/python sample_Resnet18.py
==13132== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Loading weights for net_encoder
Loading weights for net_decoder

3.3215057069996874
0.038701985999978206
0.03595212000027459
0.032525050999993255
0.036481886000274244
0.03343103799988967
0.036070784000003187
0.03377045300021564
0.03551198900004238
0.03397779599981732
0.035660191000260966
#######
==13132== Profiling application: …/env/bin/python sample_Resnet18.py
==13132== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 51.51% 166.63ms 187 891.06us 114.57us 4.0394ms volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
9.90% 32.025ms 451 71.008us 7.1040us 424.13us void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const , __half, float, float)
9.29% 30.057ms 220 136.62us 36.867us 1.0942ms ZN2at6native18elementwise_kernelILi512ELi1EZNS0_17gpu_binary_kernelIZNS0_21threshold_kernel_implIN3c104HalfEEEvRNS_14TensorIteratorET_S8_EUlS5_S5_E_EEvS7_RKS8_EUliE_EEviT1
9.13% 29.543ms 264 111.91us 33.059us 1.0472ms void at::native::batch_norm_transform_input_kernel<c10::Half, float, bool=0, int>(at::PackedTensorAccessor<c10::Half, unsigned long=3, at::RestrictPtrTraits, int>, int, at::native::batch_norm_transform_input_kernel<c10::Half<std::conditional<bool=0, float, at::PackedTensorAccessor>::type, unsigned long=1, c10::Half, at::RestrictPtrTraits>, float, bool=0, int>, std::conditional<bool=0, float, at::PackedTensorAccessor>::type, at::native::batch_norm_transform_input_kernel<c10::Half<at::PackedTensorAccessor, unsigned long=1, c10::Half, at::RestrictPtrTraits>, float, bool=0, int>, at::native::batch_norm_transform_input_kernel<c10::Half<std::conditional<bool=0, float, at::PackedTensorAccessor>::type, unsigned long=1, c10::Half, at::RestrictPtrTraits>, float, bool=0, int>, std::conditional)
7.52% 24.323ms 22 1.1056ms 565.90us 1.9540ms volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
3.24% 10.471ms 11 951.93us 588.88us 1.2502ms void MaxPoolForward<c10::Half, float>(int, c10::Half const , int, int, int, int, int, int, int, int, int, int, int, int, int, int, c10::Half, long*)
2.43% 7.8739ms 88 89.476us 34.882us 738.07us ZN2at6native18elementwise_kernelILi512ELi1EZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIN3c104HalfEEEvRNS_14TensorIteratorENS4_6ScalarEEUlS5_S5_E_EEvS7_RKT_EUliE_EEviT1
1.53% 4.9516ms 33 150.05us 47.876us 447.49us volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1
1.51% 4.8779ms 22 221.72us 50.148us 399.74us void cudnn::detail::implicit_convolve_sgemm<__half, __half, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, __half const , int, __half, cudnn::detail::implicit_convolve_sgemm<__half, __half, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>, kernel_conv_params, int, float, float, int, __half, __half, int, int)
1.08% 3.4847ms 241 14.459us 288ns 570.89us [CUDA memcpy HtoD]
0.91% 2.9365ms 11 266.96us 261.20us 271.73us void at::native::_GLOBAL__N__53_tmpxft_000062a1_00000000_8_SoftMax_compute_72_cpp1_ii_a3310042::cunn_SpatialSoftMaxForward<c10::Half, float, c10::Half, at::native::_GLOBAL__N__53_tmpxft_000062a1_00000000_8_SoftMax_compute_72_cpp1_ii_a3310042::SoftMaxForwardEpilogue>(c10::Half
, c10::Half*, unsigned int, unsigned int, unsigned int)
0.63% 2.0242ms 11 184.01us 182.38us 186.48us void caffe_gpu_interp2_kernel<c10::Half, float>(int, float, float, bool, THCDeviceTensor<c10::Half, int=4, int, DefaultPtrTraits>, THCDeviceTensor<c10::Half, int=4, int, DefaultPtrTraits>)
0.54% 1.7459ms 11 158.72us 156.59us 160.46us volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
0.31% 1.0105ms 11 91.861us 90.663us 93.799us void kernelTransformReduceOuterDimIndex<c10::Half, long, MaxValuePair<c10::Half, long>>(c10::Half*, long*, c10::Half*, unsigned int, unsigned int, unsigned int, thrust::pair<c10::Half, long>, c10::Half)
0.19% 615.76us 11 55.978us 55.748us 56.644us [CUDA memcpy DtoH]
0.18% 594.96us 176 3.3800us 1.7280us 8.6410us cudnn::gemm::computeOffsetsKernel(cudnn::gemm::ComputeOffsetsParams)
0.07% 213.61us 77 2.7740us 1.9520us 3.7760us cudnn::cask::computeOffsetsKernel(cudnn::cask::ComputeOffsetsParams)
0.02% 71.717us 11 6.5190us 6.1440us 7.1690us void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray)
0.00% 10.240us 15 682ns 320ns 1.5360us [CUDA memset]
API calls: 61.38% 5.53360s 37 149.56ms 12.385us 5.51501s cudaMalloc
36.27% 3.26963s 8 408.70ms 32.195us 3.26939s cudaStreamCreateWithFlags
0.99% 89.061ms 251 354.83us 19.426us 8.8781ms cudaMemcpyAsync
0.86% 77.784ms 1606 48.433us 27.234us 252.34us cudaLaunchKernel
0.24% 21.202ms 9867 2.1480us 1.5040us 163.72us cudaGetDevice
0.15% 13.756ms 4925 2.7930us 1.6960us 74.310us cudaSetDevice
0.03% 3.0962ms 240 12.900us 5.6960us 266.74us cudaStreamSynchronize
0.02% 2.1844ms 275 7.9430us 5.0240us 86.950us cudaEventRecord
0.01% 1.3009ms 1914 679ns 256ns 26.594us cudaGetLastError
0.01% 992.52us 15 66.168us 17.153us 105.54us cudaMemsetAsync
0.01% 710.87us 22 32.312us 23.073us 137.16us cudaBindTexture
0.00% 438.62us 189 2.3200us 1.0560us 65.893us cuDeviceGetAttribute
0.00% 258.16us 1 258.16us 258.16us 258.16us cudaHostAlloc
0.00% 155.34us 28 5.5470us 2.5600us 27.938us cudaEventCreateWithFlags
0.00% 153.77us 4 38.442us 32.002us 53.540us cudaStreamCreateWithPriority
0.00% 153.41us 1 153.41us 153.41us 153.41us cudaGetDeviceProperties
0.00% 149.55us 30 4.9840us 2.9760us 28.834us cudaFuncSetAttribute
0.00% 123.02us 11 11.183us 9.3770us 22.433us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.00% 115.63us 22 5.2550us 4.5120us 7.2330us cudaUnbindTexture
0.00% 80.262us 1 80.262us 80.262us 80.262us cudaMemcpy
0.00% 79.430us 27 2.9410us 1.5680us 26.562us cudaDeviceGetAttribute
0.00% 28.097us 14 2.0060us 544ns 11.937us cudaGetDeviceCount
0.00% 24.450us 2 12.225us 8.6410us 15.809us cuDeviceTotalMem
0.00% 20.288us 22 922ns 448ns 3.1680us cudaCreateChannelDesc
0.00% 10.976us 4 2.7440us 1.5680us 4.0640us cuDeviceGetCount
0.00% 7.1370us 2 3.5680us 3.1680us 3.9690us cudaFree
0.00% 6.8480us 3 2.2820us 1.4720us 3.7760us cuDeviceGet
0.00% 6.1130us 1 6.1130us 6.1130us 6.1130us cudaHostGetDevicePointer
0.00% 5.6320us 2 2.8160us 2.2080us 3.4240us cuDeviceGetUuid
0.00% 4.9280us 2 2.4640us 2.3680us 2.5600us cuDeviceGetName
0.00% 4.7370us 1 4.7370us 4.7370us 4.7370us cudaDeviceGetStreamPriorityRange
0.00% 3.7440us 1 3.7440us 3.7440us 3.7440us cuInit
0.00% 2.4640us 1 2.4640us 2.4640us 2.4640us cuDriverGetVersion

…WITHOUT .cpu()

==14128== NVPROF is profiling process 14128, command: …/env/bin/python sample_Resnet18.py
==14128== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Loading weights for net_encoder
Loading weights for net_decoder
3.1410531430001356
0.04112113099972703
0.033087531000091985
0.03420883299986599
0.03461205199982942
0.03569376699988425
0.03458936300012283
0.03629312500015658
0.03395003700006782
0.036653397000009136
0.03463275700005397
==14128== Profiling application: …/env/bin/python sample_Resnet18.py
==14128== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.25% 158.18ms 187 845.88us 114.50us 3.1467ms volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
10.63% 33.451ms 451 74.169us 7.1690us 696.75us void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half const , __half, float, float)
10.20% 32.109ms 220 145.95us 36.547us 1.2659ms ZN2at6native18elementwise_kernelILi512ELi1EZNS0_17gpu_binary_kernelIZNS0_21threshold_kernel_implIN3c104HalfEEEvRNS_14TensorIteratorET_S8_EUlS5_S5_E_EEvS7_RKS8_EUliE_EEviT1
8.84% 27.841ms 264 105.46us 33.602us 954.56us void at::native::batch_norm_transform_input_kernel<c10::Half, float, bool=0, int>(at::PackedTensorAccessor<c10::Half, unsigned long=3, at::RestrictPtrTraits, int>, int, at::native::batch_norm_transform_input_kernel<c10::Half<std::conditional<bool=0, float, at::PackedTensorAccessor>::type, unsigned long=1, c10::Half, at::RestrictPtrTraits>, float, bool=0, int>, std::conditional<bool=0, float, at::PackedTensorAccessor>::type, at::native::batch_norm_transform_input_kernel<c10::Half<at::PackedTensorAccessor, unsigned long=1, c10::Half, at::RestrictPtrTraits>, float, bool=0, int>, at::native::batch_norm_transform_input_kernel<c10::Half<std::conditional<bool=0, float, at::PackedTensorAccessor>::type, unsigned long=1, c10::Half, at::RestrictPtrTraits>, float, bool=0, int>, std::conditional)
7.90% 24.861ms 22 1.1300ms 564.52us 2.2058ms volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc2nchw_tn_v1
2.87% 9.0469ms 11 822.45us 586.41us 1.1703ms void MaxPoolForward<c10::Half, float>(int, c10::Half const , int, int, int, int, int, int, int, int, int, int, int, int, int, int, c10::Half, long*)
2.18% 6.8539ms 88 77.885us 35.202us 144.30us ZN2at6native18elementwise_kernelILi512ELi1EZNS0_17gpu_binary_kernelIZNS0_15add_kernel_implIN3c104HalfEEEvRNS_14TensorIteratorENS4_6ScalarEEUlS5_S5_E_EEvS7_RKT_EUliE_EEviT1
1.59% 4.9927ms 33 151.29us 47.523us 733.55us volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_interior_nhwc2nchw_tn_v1
1.55% 4.8688ms 22 221.31us 49.571us 397.82us void cudnn::detail::implicit_convolve_sgemm<__half, __half, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>(int, int, int, __half const , int, __half, cudnn::detail::implicit_convolve_sgemm<__half, __half, int=1024, int=5, int=5, int=3, int=3, int=3, int=1, bool=1, bool=0, bool=1>, kernel_conv_params, int, float, float, int, __half, __half, int, int)
1.08% 3.3948ms 241 14.086us 288ns 584.65us [CUDA memcpy HtoD]
0.92% 2.8976ms 11 263.42us 258.35us 271.38us void at::native::_GLOBAL__N__53_tmpxft_000062a1_00000000_8_SoftMax_compute_72_cpp1_ii_a3310042::cunn_SpatialSoftMaxForward<c10::Half, float, c10::Half, at::native::_GLOBAL__N__53_tmpxft_000062a1_00000000_8_SoftMax_compute_72_cpp1_ii_a3310042::SoftMaxForwardEpilogue>(c10::Half
, c10::Half*, unsigned int, unsigned int, unsigned int)
0.65% 2.0357ms 11 185.07us 182.32us 186.12us void caffe_gpu_interp2_kernel<c10::Half, float>(int, float, float, bool, THCDeviceTensor<c10::Half, int=4, int, DefaultPtrTraits>, THCDeviceTensor<c10::Half, int=4, int, DefaultPtrTraits>)
0.55% 1.7460ms 11 158.73us 157.32us 160.39us volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
0.32% 1.0080ms 11 91.633us 89.542us 93.317us void kernelTransformReduceOuterDimIndex<c10::Half, long, MaxValuePair<c10::Half, long>>(c10::Half*, long*, c10::Half*, unsigned int, unsigned int, unsigned int, thrust::pair<c10::Half, long>, c10::Half)
0.20% 627.98us 11 57.088us 55.715us 66.501us [CUDA memcpy DtoH]
0.18% 574.37us 176 3.2630us 1.6960us 8.2570us cudnn::gemm::computeOffsetsKernel(cudnn::gemm::ComputeOffsetsParams)
0.07% 209.77us 77 2.7240us 1.9200us 3.7120us cudnn::cask::computeOffsetsKernel(cudnn::cask::ComputeOffsetsParams)
0.02% 69.764us 11 6.3420us 6.0490us 6.8490us void op_generic_tensor_kernel<int=2, __half, float, __half, int=256, cudnnGenericOp_t=0, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=0>(cudnnTensorStruct, __half*, cudnnTensorStruct, __half const *, cudnnTensorStruct, __half const *, float, float, float, float, dimArray, reducedDivisorArray)
0.00% 10.144us 15 676ns 288ns 1.6000us [CUDA memset]
API calls: 63.56% 5.74995s 37 155.40ms 12.897us 5.73074s cudaMalloc
34.16% 3.09045s 8 386.31ms 32.546us 3.09021s cudaStreamCreateWithFlags
0.93% 83.765ms 1606 52.157us 28.385us 272.88us cudaLaunchKernel
0.88% 79.891ms 251 318.29us 20.321us 7.9057ms cudaMemcpyAsync
0.23% 21.167ms 9867 2.1450us 1.5040us 145.03us cudaGetDevice
0.13% 11.911ms 4925 2.4180us 1.6000us 65.060us cudaSetDevice
0.03% 2.6668ms 240 11.111us 5.6640us 70.980us cudaStreamSynchronize
0.02% 2.1764ms 275 7.9140us 5.1520us 101.54us cudaEventRecord
0.01% 1.2593ms 1914 657ns 256ns 24.033us cudaGetLastError
0.01% 1.1157ms 15 74.377us 32.641us 119.01us cudaMemsetAsync
0.01% 587.07us 22 26.684us 22.209us 49.091us cudaBindTexture
0.00% 377.88us 189 1.9990us 928ns 77.572us cuDeviceGetAttribute
0.00% 259.02us 1 259.02us 259.02us 259.02us cudaHostAlloc
0.00% 151.66us 28 5.4160us 2.5280us 25.825us cudaEventCreateWithFlags
0.00% 143.72us 4 35.929us 30.625us 44.994us cudaStreamCreateWithPriority
0.00% 134.15us 1 134.15us 134.15us 134.15us cudaGetDeviceProperties
0.00% 124.97us 22 5.6800us 4.3520us 16.897us cudaUnbindTexture
0.00% 124.14us 30 4.1370us 2.8800us 15.969us cudaFuncSetAttribute
0.00% 121.93us 11 11.084us 9.7610us 12.865us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
0.00% 81.540us 1 81.540us 81.540us 81.540us cudaMemcpy
0.00% 66.754us 27 2.4720us 1.5360us 22.593us cudaDeviceGetAttribute
0.00% 47.619us 14 3.4010us 416ns 24.737us cudaGetDeviceCount
0.00% 23.681us 2 11.840us 6.3690us 17.312us cuDeviceTotalMem
0.00% 12.482us 22 567ns 448ns 673ns cudaCreateChannelDesc
0.00% 11.328us 4 2.8320us 1.5360us 4.0960us cuDeviceGetCount
0.00% 6.1760us 2 3.0880us 1.5040us 4.6720us cuDeviceGetUuid
0.00% 6.1440us 2 3.0720us 2.8800us 3.2640us cudaFree
0.00% 5.9840us 1 5.9840us 5.9840us 5.9840us cudaHostGetDevicePointer
0.00% 5.6640us 3 1.8880us 1.8560us 1.9200us cuDeviceGet
0.00% 5.0570us 1 5.0570us 5.0570us 5.0570us cudaDeviceGetStreamPriorityRange
0.00% 4.3530us 2 2.1760us 2.0480us 2.3050us cuDeviceGetName
0.00% 3.6480us 1 3.6480us 3.6480us 3.6480us cuInit
0.00% 2.2720us 1 2.2720us 2.2720us 2.2720us cuDriverGetVersion

Hi,

There is no big difference between the profiling data.
Is there still some performance degraded in the half mode with cpu()?

Thanks.

Hello,

I am experiencing the same issue, is there any update or workaround?

Thanks

I improved the performance by decreasing the precision in pytorch. Load both your input image and trained model as half tensors.

Thanks agarwal, but i don’t think there our Project can afford to loose precisión

isn’t there any quicker transfer method?