Use ffmpeg with CPU filter after decoding and filtering with GPU

I’m performing a transcoding on home videos for device support using ffmpeg - trying to make the best use of my GPU. I am struggling with use of a CPU based filter - the following works fine:

ffmpeg -y -vsync 0 -hwaccel cuda -hwaccel_output_format cuda -i original.mp4 -vf "scale_npp=format=yuv420p,transpose_npp=clock,scale_npp=1280:720:interp_algo=super:force_original_aspect_ratio=decrease" -c:a copy -c:v h264_nvenc -preset p6 -rc-lookahead 20 -b:v 5M out.mp4

But as soon as I try and add a download then CPU based filtering it’s not working with some very generic errors (as typical with trying to use a CPU based filter in a hardware accelerated context):

ffmpeg -y -vsync 0 -hwaccel cuda -hwaccel_output_format cuda -i original.mp4 -vf "scale_npp=format=yuv420p,transpose_npp=clock,scale_npp=1280:720:interp_algo=super:force_original_aspect_ratio=decrease,hwdownload,pad=width=1280:height=720,hwupload" -c:a copy -c:v h264_nvenc -preset p6 -rc-lookahead 20 -b:v 5M out.mp4

Error:

[hwdownload @ 0x555853f82a40] Input frame is not the in the configured hwframe context.
Error while filtering: Invalid argument
Failed to inject frame into filter network: Invalid argument
Error while processing the decoded data for stream #0:2
Conversion failed!

I have seen the recommended GPU and CPU mixing commands documented here = https://developer.nvidia.com/blog/nvidia-ffmpeg-transcoding-guide/ - just about every example I’ve seen involves some CPU decoding or filtering at the beginning of the chain - here I just want to pad right before encoding so it seems like a shame to have to hwupload to scale and tranpose then download for padding only to turn around and re-upload for encoding. Is that the right way to do this?

Any help would be much appreciated.