GTX 650 - Vulkan rendering is slower than OpenGL

Asylum2014 · June 10, 2016, 4:20pm

I implemented drawcall batching with secondary command buffers in order to give an example of Vulkan’s benefits over OpenGL. However, OpenGL is way faster (130 fps vs. 40).

Code here: Dropbox - File Deleted

You are interested in 71_DrawBatching and 71_CompareToGL. Due to the lack of tools I can’t determine why Vulkan is so much slower… I appreciate any help.

Another thought: the driver seems to be very “permissive” as it didn’t warn me about the relation of VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS and vkCmdExecuteCommands while the standard explicitly states that the latter is the only valid “draw” command when beginnig the render pass with this flag.

(ps.: I have the latest Vulkan driver)
(ps.: in the 71_DrawBatching sample you can use the A key to stop animating and the D key to show the debug camera)

SaschaWillems · June 11, 2016, 11:41am

I just took a quick look at that code and it seems you don’t copy your buffers to device local memory. All of your buffers are created with the host visible bit and are therefore mapped into the address space of the host, and that kills performance.

So stage your buffers first and then test performance again:

Create staging buffers with host visibility
Copy your data there (vertices, indices, storage, etc.)
Create your buffers used in the rendering and shaders with a device local memory type
Copy data from the staging buffers to these
Delete the staging buffers

This is one of the things that OpenGL did implicitly, which is no longer the case with Vulkan.

Asylum2014 · June 13, 2016, 12:49pm

I’m trying to do this, but I’m having a pretty hard time with buffer memory barriers.

Imagine the following:

vkCmdCopy + buffer barrier for prototypes[0] <— when doing only this, everything is ok
vkCmdCopy + buffer barrier for prototypes[1] <— I get a black screen and computer might even freeze to death

Another thing:

vkCmdCopy + buffer barrier for debugmesh <— I get

vkCmdBindVertexBuffers(): Cannot read invalid memory 0x3c, please fill the memory before using.
vkCmdBindIndexBuffer(): Cannot read invalid memory 0x40, please fill the memory before using.

Which is totally inunderstandable unless it is related to debugmesh being used in the second subpass…

Code of syncing:

void VulkanBuffer::Synchronize(VkCommandBuffer commandbuffer)
{
    if( stagingbuffer )
    {
        VkBufferCopy region;

        region.srcOffset    = 0;
        region.dstOffset    = 0;
        region.size         = originalsize; // memreqs.size didn't work either

        vkCmdCopyBuffer(commandbuffer, stagingbuffer, buffer, 1, &region);
        VulkanBufferAccessTransferBarrier(commandbuffer, buffer, VK_ACCESS_TRANSFER_WRITE_BIT, VK_ACCESS_SHADER_READ_BIT);

        if( !(exflags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT) )
        {
            vkDestroyBuffer(driverinfo.device, stagingbuffer, 0);
            vkFreeMemory(driverinfo.device, stagingmemory, 0);

            stagingbuffer = 0;
            stagingmemory = 0;
        }
    }
}

And

void VulkanBufferAccessTransferBarrier(VkCommandBuffer commandbuffer, VkBuffer buffer, VkAccessFlags oldflags, VkAccessFlags newflags)
{
    VkBufferMemoryBarrier barrier = {};

    barrier.sType                = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
    barrier.pNext                = NULL;
    barrier.buffer               = buffer;
    barrier.srcAccessMask        = oldflags;
    barrier.dstAccessMask        = newflags;
    barrier.srcQueueFamilyIndex  = VK_QUEUE_FAMILY_IGNORED;
    barrier.dstQueueFamilyIndex  = VK_QUEUE_FAMILY_IGNORED;
    barrier.offset               = 0;
    barrier.size                 = VK_WHOLE_SIZE;

    // VK_PIPELINE_STAGE_TRANSFER_BIT didn't work either
    vkCmdPipelineBarrier(commandbuffer, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, 0, 0, NULL, 1, &barrier, 0, NULL);
}

Asylum2014 · June 14, 2016, 2:37pm

Solved (stil slower [80 vs. 160 fps], but acceptable).
I will post the corrected code in case anyone else gets stuck with this problem.

In a nutshell: no buffer barrier needed; staging buffer must be kept until copy finishes (use a separate command buffer and vkQueueWaitIdle()).

SaschaWillems · June 15, 2016, 6:42am

Here are a few things that I noticed that will impact pref. Note that I haven’t run the code (can’t do that here at work):

It seems that you at least partially rebuild your command buffers on each frame. If that's the case then that's one of your main perf problems. The whole idea on command buffers is to only (re)create them if required and then do multiple submits. As you're already using secondary CBs for your tiles, try to only recreate them if the actual visibility has changed, e.g. when changing the camera.
Disable validation. The layers will cost performance.
Create fence only once. Same as for the CB, no need to recreate the fence each frame.

Asylum2014 · June 15, 2016, 7:53am

Thanks for your reply, but:

I regen the cmds only when their visibility changes (if you press the A key then no cmdbuff should be generated while the fps is still low)
in Release mode validation is disabled and I get like 5 fps more
fence: according to the spec it should be fast to create/destroy it, but ok, I will do that (ps.: brought another 1-2 fps)

Updated code: https://www.dropbox.com/s/oj28rb0sia8bml6/FOR_NVIDIA_2.zip?dl=0

ps.: I added a define to disable secondary command buffers; still slower than GL
ps.: I have this perfmeasure thing, which measures specific points in the code and I can tell that not the cmdbuff generation is the bottleneck. Example output after running for 17 seconds:

Update tiles:    0.175106 s (1759)
Encode:          0.0239156 s (1759)
Wait for fence:  16.7513 s (1759)
Present:         0.597658 s (1759)

laurentduroisin · October 23, 2025, 8:45pm

I’ve the same problem on my nvidia gtx 1660 super : vulkan is slow even after creating pipelines in advance, using multithreading with secondary command buffers recording, double buffering, and dynamic descriptor sets to reduce the number of copies and parallelize the copies to the stagging buffer and the gpu buffer.

laurentduroisin · November 4, 2025, 4:54pm

I’ve found why vulkan was slower than opengl now I’ve the same FPS with vulkan and opengl.

Topic		Replies	Views
Vulkan extremely slow compared to opengl Vulkan	4	168	November 4, 2025
Poor multithreading performance compared to DX12 Vulkan	17	5745	September 29, 2020
[Bug]Secondary commands buffers not registered. (Driver bug ?) Vulkan	3	51	October 24, 2025
Tips and Tricks: Vulkan Dos and Don'ts Technical Blog	6	1214	November 11, 2021
Performance problems with Vulkan on Linux with Nvidia Quadro M1000M Driver Version: 510.73.05 Vulkan	0	722	August 18, 2022
Using multi-threading blocks the vulkan driver Vulkan	2	110	June 10, 2025
Need help OpenGL gl_NV_command_list extensions OpenGL	14	1082	July 6, 2023
Vulkan driver -- uniform buffer bug Vulkan	8	4443	September 24, 2016
Poor performance in comparison with OpenGL 4 Vulkan	4	1735	October 21, 2019
Sharing render buffers or render textures among multiple OpenGL contexts OpenGL	28	8420	August 28, 2020

GTX 650 - Vulkan rendering is slower than OpenGL

Related topics