[Bug][Vulkan 1.2 beta drivers] Memory copies to the GPU_Local CPU_coherent heap are 1000x slower than normal

On latest vulkan 1.2 beta drivers, .443.09
MSI GT72VR GRE Dominator Pro laptop. I7 6700HQ + 1070 GTX.

When developing on my vulkan toy engine, i found a extreme regression on performance for uniform data uploads.
This uniform data uploads is about 45 kilobytes, uploaded every frame, and takes 10 miliseconds to perform using “memcpy” on Visual Studio 2019.

Looking at the issue,the Vulkan Memory Allocator library is selecting the 3rd heap, the one that is 256 megabytes, and is both DEVICE_LOCAL and HOST_VISIBLE and HOST_COHERENT. According to documentation on both VMA library and nvidia presentations, this heap is perfect for the use case of uploading commonly changing uniforms.

I found that every single memory write has a big constant cost. The Visual Studio 2019 implementation of memcpy uses the instruction “rep movs” to copy the data, this instruction can be used through the intrinsic __movsb()

When comparing against AVX memory copies (more data per instruction), i found a linear difference in speed. It can be inmediately tested by using the other instructions of the movs family.

When using __movsb(), which copies one byte at a time, it takes 9 miliseconds to copy 45 kilobytes of memory. When using __movsq() instead, which copies 8 bytes at a time, it takes 1 milisecond to copy the same amount of data. 8x more data per instruction = speedup of exactly 8 times. When using AVX and copying more data at once, the same linear improvement can be seen.

When copying into the CPU memory heap (no DEVICE_LOCAL flag) this copy takes 10 microseconds, without much change depending on the method used to upload the memory.

This only happens on the laptop model explained above. When running the same program on a desktop PC with a dedicated RTX 2080, the speed is more around the expected values, without this dramatic performance hit.