RTX 2070 running extremely slow on low grid size compared to 1070

Hello all,

I have been struggling with the following issue for a few days now. My knowledge of CUDA is basic.

My program generates several one dimensional grids and does computation for basically 2D fluid dynamics.

I am able to complete a 512x512 grid for 6,000 cycles in 15 seconds on a 1070. I have a customer with the 2070 that completes the same simulation in 4 minutes, where it should be around 7 seconds. This is also confused by the fact that when the grid is scaled to 8196x8196 his 2070 GPU performs twice as fast, as expected. This keeps corelating with the grid size. We both have 64bit Windows 10.

I am compiling to a .DLL and calling CUDA functions from Python each cycle. To cover as many GPUs as possible I am compiling for all compute capability versions 3.5 and up.

“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\bin\nvcc.exe” -gencode=arch=compute_35,code=“sm_35,compute_35” -gencode=arch=compute_37,code=“sm_37,compute_37” -gencode=arch=compute_50,code=“sm_50,compute_50” -gencode=arch=compute_52,code=“sm_52,compute_52” -gencode=arch=compute_53,code=“sm_53,compute_53” -gencode=arch=compute_60,code=“sm_60,compute_60” -gencode=arch=compute_61,code=“sm_61,compute_61” -gencode=arch=compute_62,code=“sm_62,compute_62” -gencode=arch=compute_70,code=“sm_70,compute_70” -gencode=arch=compute_72,code=“sm_72,compute_72” -gencode=arch=compute_75,code=“sm_75,compute_75” -gencode=arch=compute_80,code=“sm_80,compute_80” -gencode=arch=compute_86,code=“sm_86,compute_86” --use-local-env -ccbin “C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64” -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static --library=cudart -DCODE_ANALYSIS -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_WINDLL -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc142.pdb /FS /MD " -o x64\Release\gpu_compute.cu.obj

I have tried lowering block size, that did not seem to help (from 256 to 128).

Could this have something to do with “Maximum number of resident threads per multiprocessor” going from 2048 to 1024 for 2070?

I am definitely in over my head…

Appreciate any help or advice on this.