GTX 650 - Vulkan rendering is slower than OpenGL

I implemented drawcall batching with secondary command buffers in order to give an example of Vulkan’s benefits over OpenGL. However, OpenGL is way faster (130 fps vs. 40).

Code here: https://www.dropbox.com/s/nz4ggrqgipqqt7t/FOR_NVIDIA.zip?dl=0

You are interested in 71_DrawBatching and 71_CompareToGL. Due to the lack of tools I can’t determine why Vulkan is so much slower… I appreciate any help.

Another thought: the driver seems to be very “permissive” as it didn’t warn me about the relation of VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS and vkCmdExecuteCommands while the standard explicitly states that the latter is the only valid “draw” command when beginnig the render pass with this flag.

(ps.: I have the latest Vulkan driver)
(ps.: in the 71_DrawBatching sample you can use the A key to stop animating and the D key to show the debug camera)

I just took a quick look at that code and it seems you don’t copy your buffers to device local memory. All of your buffers are created with the host visible bit and are therefore mapped into the address space of the host, and that kills performance.

So stage your buffers first and then test performance again:

  • Create staging buffers with host visibility
  • Copy your data there (vertices, indices, storage, etc.)
  • Create your buffers used in the rendering and shaders with a device local memory type
  • Copy data from the staging buffers to these
  • Delete the staging buffers

This is one of the things that OpenGL did implicitly, which is no longer the case with Vulkan.

I’m trying to do this, but I’m having a pretty hard time with buffer memory barriers.

Imagine the following:

  • vkCmdCopy + buffer barrier for prototypes[0] <— when doing only this, everything is ok
  • vkCmdCopy + buffer barrier for prototypes[1] <— I get a black screen and computer might even freeze to death

Another thing:

  • vkCmdCopy + buffer barrier for debugmesh <— I get

vkCmdBindVertexBuffers(): Cannot read invalid memory 0x3c, please fill the memory before using.
vkCmdBindIndexBuffer(): Cannot read invalid memory 0x40, please fill the memory before using.

Which is totally inunderstandable unless it is related to debugmesh being used in the second subpass…

Code of syncing:

void VulkanBuffer::Synchronize(VkCommandBuffer commandbuffer)
{
    if( stagingbuffer )
    {
        VkBufferCopy region;

        region.srcOffset    = 0;
        region.dstOffset    = 0;
        region.size         = originalsize; // memreqs.size didn't work either

        vkCmdCopyBuffer(commandbuffer, stagingbuffer, buffer, 1, &region);
        VulkanBufferAccessTransferBarrier(commandbuffer, buffer, VK_ACCESS_TRANSFER_WRITE_BIT, VK_ACCESS_SHADER_READ_BIT);

        if( !(exflags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT) )
        {
            vkDestroyBuffer(driverinfo.device, stagingbuffer, 0);
            vkFreeMemory(driverinfo.device, stagingmemory, 0);

            stagingbuffer = 0;
            stagingmemory = 0;
        }
    }
}

And

void VulkanBufferAccessTransferBarrier(VkCommandBuffer commandbuffer, VkBuffer buffer, VkAccessFlags oldflags, VkAccessFlags newflags)
{
    VkBufferMemoryBarrier barrier = {};

    barrier.sType                = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
    barrier.pNext                = NULL;
    barrier.buffer               = buffer;
    barrier.srcAccessMask        = oldflags;
    barrier.dstAccessMask        = newflags;
    barrier.srcQueueFamilyIndex  = VK_QUEUE_FAMILY_IGNORED;
    barrier.dstQueueFamilyIndex  = VK_QUEUE_FAMILY_IGNORED;
    barrier.offset               = 0;
    barrier.size                 = VK_WHOLE_SIZE;

    // VK_PIPELINE_STAGE_TRANSFER_BIT didn't work either
    vkCmdPipelineBarrier(commandbuffer, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, 0, 0, NULL, 1, &barrier, 0, NULL);
}

Solved (stil slower [80 vs. 160 fps], but acceptable).
I will post the corrected code in case anyone else gets stuck with this problem.

In a nutshell: no buffer barrier needed; staging buffer must be kept until copy finishes (use a separate command buffer and vkQueueWaitIdle()).

Here are a few things that I noticed that will impact pref. Note that I haven’t run the code (can’t do that here at work):

  • It seems that you at least partially rebuild your command buffers on each frame. If that's the case then that's one of your main perf problems. The whole idea on command buffers is to only (re)create them if required and then do multiple submits. As you're already using secondary CBs for your tiles, try to only recreate them if the actual visibility has changed, e.g. when changing the camera.
  • Disable validation. The layers will cost performance.
  • Create fence only once. Same as for the CB, no need to recreate the fence each frame.

Thanks for your reply, but:

  1. I regen the cmds only when their visibility changes (if you press the A key then no cmdbuff should be generated while the fps is still low)

  2. in Release mode validation is disabled and I get like 5 fps more

  3. fence: according to the spec it should be fast to create/destroy it, but ok, I will do that (ps.: brought another 1-2 fps)

Updated code: https://www.dropbox.com/s/oj28rb0sia8bml6/FOR_NVIDIA_2.zip?dl=0

ps.: I added a define to disable secondary command buffers; still slower than GL
ps.: I have this perfmeasure thing, which measures specific points in the code and I can tell that not the cmdbuff generation is the bottleneck. Example output after running for 17 seconds:

Update tiles:    0.175106 s (1759)
Encode:          0.0239156 s (1759)
Wait for fence:  16.7513 s (1759)
Present:         0.597658 s (1759)