Vulkan compute shaders vs. CUDA

markus.kivioja84 · November 12, 2021, 1:40pm

According to my tests, the usage of local on-chip shared memory doesn’t seem to bring any performance benefit in Vulkan compute shaders on Nvidia GPUs.

I have written a test shader that demonstrates this behavior and it is ~30x slower (15 ms vs. 0.5 ms) on Nvidia Vulkan than on CUDA or on Vulkan with other manufacturers’ GPU. I used GeForce RTX 2080 Ti, driver 496.49, Vulkan 1.2.189.2, and CUDA 11.1. The test shaders are below.

Vulkan:

#version 450
#extension GL_EXT_control_flow_attributes : enable

#define LOCAL_SIZE_X 128
#define LOCAL_SIZE_Y 1
#define LOCAL_SIZE_Z 1

layout(local_size_x = LOCAL_SIZE_X, local_size_y = LOCAL_SIZE_Y, local_size_z = LOCAL_SIZE_Z) in;

layout(std430, set = 0, binding = 0) readonly buffer InTensor
{
	float data[];
} inTensor;

layout(std430, set = 0, binding = 1) readonly buffer WeightsTensor
{
	float data[];
} weights;

layout(std430, set = 0, binding = 2) readonly buffer BiasesTensor
{
	float data[];
} biases;

layout(std430, set = 0, binding = 3) writeonly buffer OutTensor
{
	float data[];
} outTensor;

layout(set = 0, binding = 4) uniform Uniforms
{
	int inChannelCount;
	int outChannelCount;
	int inH;
	int inW;
	int outH;
	int outW;
	ivec2 padding;
} uniforms;

shared float cache[LOCAL_SIZE_X];
shared float outCache[LOCAL_SIZE_X];
shared float weightCache[LOCAL_SIZE_X];

void main()
{
	ivec3 gIds = ivec3(gl_GlobalInvocationID.xyz);
	ivec3 lIds = ivec3(gl_LocalInvocationID.xyz);

	int inChannels = uniforms.inChannelCount;
	int outChannels = uniforms.outChannelCount;
	int inSliceSize = uniforms.inH * uniforms.inW;

	int nIdx = int(gl_WorkGroupID.x);
	int x = gIds.y;
	int y = gIds.z;
	int inIdx = nIdx * inChannels * inSliceSize + lIds.x * inSliceSize + y * uniforms.inW + x;
	cache[lIds.x] = inTensor.data[inIdx];
	weightCache[lIds.x] = weights.data[lIds.x];
	barrier();

	int halfIdx = lIds.x / 64;

	float outVal = 0;
	int startC = halfIdx * 64;
	[[unroll]] for (int inCidx = startC; inCidx < (startC + 64); ++inCidx)
	{
		int weightIdx = (lIds.x - startC) * inChannels + inCidx;
		outVal += cache[inCidx] * weightCache[inCidx];
	}
	outCache[lIds.x] = outVal;
	barrier();
	if (halfIdx > 0) return;

	int outSliceSize = uniforms.outH * uniforms.outW;
	int outIdx = nIdx * outChannels * outSliceSize + lIds.x * outSliceSize + y * uniforms.outW + x;
    outTensor.data[outIdx] = outCache[lIds.x] + outCache[lIds.x + 64] + biases.data[lIds.x];
}

CUDA:

#define LOCAL_SIZE_X 128
#define LOCAL_SIZE_Y 1
#define LOCAL_SIZE_Z 1

static constexpr int N = 2;
static constexpr int C_in = 128;
static constexpr int C_out = 64;
static constexpr int H = 108;
static constexpr int W = 180;

__global__ void convolution(const float* inTensor, const float* weights, const float* biases, float* outTensor)
{
	__shared__ float cache[LOCAL_SIZE_X];
	__shared__ float outCache[LOCAL_SIZE_X];
	__shared__ float weightCache[LOCAL_SIZE_X];

	// Global thread indices
	size_t gy = blockIdx.y * blockDim.y + threadIdx.y;
	size_t gz = blockIdx.z * blockDim.z + threadIdx.z;
	// Local thread indices
	size_t lx = threadIdx.x;

	int inSliceSize = H * W;

	int nIdx = int(blockIdx.x);
	int x = gy;
	int y = gz;
	int inIdx = nIdx * C_in * inSliceSize + lx * inSliceSize + y * W + x;
	cache[lx] = inTensor[inIdx];
	weightCache[lx] = weights[lx];
	__syncthreads();

	int halfIdx = lx / 64;

	float outVal = 0;
	int startC = halfIdx * 64;
	for (int inCidx = startC; inCidx < (startC + 64); ++inCidx)
	{
		int weightIdx = (lx - startC) * C_in + inCidx;
		outVal += cache[inCidx] * weightCache[inCidx];
	}
	outCache[lx] = outVal;
	__syncthreads();
	if (halfIdx > 0) return;

	int outSliceSize = H * W;
	int outIdx = nIdx * C_out * outSliceSize + lx * outSliceSize + y * W + x;
	outTensor[outIdx] = outCache[lx] + outCache[lx + 64] + biases[lx];
}

MarkusHoHo · November 12, 2021, 2:54pm

Hello @markus.kivioja84 and welcome to the NVIDIA developer forums!

Thank you for bringing this to our attention!

I will pass it to our internal developers. If I can share any further information, I will follow up here.

I hope your findings do not impede you in your project!

MarkusHoHo · November 12, 2021, 3:37pm

One more question @markus.kivioja84 , could you tell me which Operating system you are using for your tests?

Thanks!

markus.kivioja84 · November 12, 2021, 6:11pm

Hi @MarkusHoHo, thank you for the quick response!

The operating system is Windows 10.

MarkusHoHo · December 2, 2021, 3:44pm

Hi again @markus.kivioja84 ,

it has been a while and I would like to ask for some more information. So far we have trouble reproducing the behavior you described by just using the provided shaders.

Is it possible for you to share your simple application with us that shows this performance difference on your system? That would help a lot.

Thanks!

markus.kivioja84 · December 9, 2021, 10:05am

Hi @MarkusHoHo ,

I have attached a zip package that contains source codes for runnable programs on CUDA and Vulkan.

When built and run they should both print out the spent GPU time and the output of the computation to verify that the two versions match with each other.

For example, these are the output prints I got:

CUDA:

GPU time: 0.5873 ms.
Output: 129, 129, ..., 129, 129

Vulkan:

GPU time: 19.3922 ms.
Output: 129, 129, ..., 129, 129

Thank you!
nvidia_perf_tests.zip (21.3 KB)

MarkusHoHo · December 9, 2021, 1:18pm

Thank you!

I will forward this to our engineers.

MarkusHoHo · December 16, 2021, 2:07pm

Hi again,

I now received some feedback on your sample app and one thing that was pointed out is that you do not allocate GPU memory for your buffers. You use VK_MEMORY_PROPERTY_HOST which allocates system memory. It is very likely that this causes the performance differences.

Reading your original post where you mention shared memory I interpret it that you indeed want to use on-chip, meaning GPU memory. For that you should look into usage of VK_MEMORY_PROPERTY_DEVICE_LOCAL instead.

I hope this helps!

markus.kivioja84 · December 19, 2021, 9:27am

Hi and thank you for the response!

The compute shader code explicitly copies the data to the on-chip memory, so I thought it shouldn’t matter where the source data is stored in. After the copy cache[lx] = inTensor[inIdx]; the data should be in the same place with both allocation types (VK_MEMORY_PROPERTY_HOST and VK_MEMORY_PROPERTY_DEVICE_LOCAL).

But I guess the reason for the performance difference is then the overhead of the copy from the CPU memory to the on-chip memory which fully hides the benefits of the on-chip memory usage after that.

MarkusHoHo · December 20, 2021, 12:50pm

That would be my assumption as well.

I asked for further clarification, but I will probably not receive feedback before Christmas. Hopefully you can continue your project regardless?

Thank you again for bringing this up! I really appreciate your interest in this topic.

Topic		Replies	Views
Cuda vs Vulkan - performance issue (possibly __syncwarp related) Vulkan cuda , kernel , performance , vulkan	4	956	October 8, 2024
Compute Shader Performance Vulkan	11	8246	June 8, 2016
Cuda vs Vulkan performance difference CUDA Programming and Performance vulkan	5	7649	January 5, 2023
CUDA Vulkan VkImage Interop CUDA Programming and Performance cuda	3	1656	January 27, 2024
Vulkan driver -- uniform buffer bug Vulkan	8	4241	September 24, 2016
CUDA-Vulkan buffer interop fails when buffer size <= 448KB CUDA Programming and Performance cuda , vulkan	1	1404	November 30, 2021
Decreased performance when using textures CUDA Programming and Performance	2	463	April 8, 2019
Random loads of small unsigned integer values, __constant__ vs. __shared__ vs. global CUDA Programming and Performance	13	1435	August 9, 2015
Tips and Tricks: Vulkan Dos and Don'ts Technical Blog	6	1057	November 11, 2021
Poor performance in comparison with OpenGL 4 Vulkan	4	1641	October 21, 2019

Vulkan compute shaders vs. CUDA

Related topics