Vulkan compute shaders vs. CUDA

According to my tests, the usage of local on-chip shared memory doesn’t seem to bring any performance benefit in Vulkan compute shaders on Nvidia GPUs.

I have written a test shader that demonstrates this behavior and it is ~30x slower (15 ms vs. 0.5 ms) on Nvidia Vulkan than on CUDA or on Vulkan with other manufacturers’ GPU. I used GeForce RTX 2080 Ti, driver 496.49, Vulkan 1.2.189.2, and CUDA 11.1. The test shaders are below.

Vulkan:

#version 450
#extension GL_EXT_control_flow_attributes : enable

#define LOCAL_SIZE_X 128
#define LOCAL_SIZE_Y 1
#define LOCAL_SIZE_Z 1

layout(local_size_x = LOCAL_SIZE_X, local_size_y = LOCAL_SIZE_Y, local_size_z = LOCAL_SIZE_Z) in;

layout(std430, set = 0, binding = 0) readonly buffer InTensor
{
	float data[];
} inTensor;

layout(std430, set = 0, binding = 1) readonly buffer WeightsTensor
{
	float data[];
} weights;

layout(std430, set = 0, binding = 2) readonly buffer BiasesTensor
{
	float data[];
} biases;

layout(std430, set = 0, binding = 3) writeonly buffer OutTensor
{
	float data[];
} outTensor;

layout(set = 0, binding = 4) uniform Uniforms
{
	int inChannelCount;
	int outChannelCount;
	int inH;
	int inW;
	int outH;
	int outW;
	ivec2 padding;
} uniforms;

shared float cache[LOCAL_SIZE_X];
shared float outCache[LOCAL_SIZE_X];
shared float weightCache[LOCAL_SIZE_X];

void main()
{
	ivec3 gIds = ivec3(gl_GlobalInvocationID.xyz);
	ivec3 lIds = ivec3(gl_LocalInvocationID.xyz);

	int inChannels = uniforms.inChannelCount;
	int outChannels = uniforms.outChannelCount;
	int inSliceSize = uniforms.inH * uniforms.inW;

	int nIdx = int(gl_WorkGroupID.x);
	int x = gIds.y;
	int y = gIds.z;
	int inIdx = nIdx * inChannels * inSliceSize + lIds.x * inSliceSize + y * uniforms.inW + x;
	cache[lIds.x] = inTensor.data[inIdx];
	weightCache[lIds.x] = weights.data[lIds.x];
	barrier();

	int halfIdx = lIds.x / 64;

	float outVal = 0;
	int startC = halfIdx * 64;
	[[unroll]] for (int inCidx = startC; inCidx < (startC + 64); ++inCidx)
	{
		int weightIdx = (lIds.x - startC) * inChannels + inCidx;
		outVal += cache[inCidx] * weightCache[inCidx];
	}
	outCache[lIds.x] = outVal;
	barrier();
	if (halfIdx > 0) return;

	int outSliceSize = uniforms.outH * uniforms.outW;
	int outIdx = nIdx * outChannels * outSliceSize + lIds.x * outSliceSize + y * uniforms.outW + x;
    outTensor.data[outIdx] = outCache[lIds.x] + outCache[lIds.x + 64] + biases.data[lIds.x];
}

CUDA:

#define LOCAL_SIZE_X 128
#define LOCAL_SIZE_Y 1
#define LOCAL_SIZE_Z 1

static constexpr int N = 2;
static constexpr int C_in = 128;
static constexpr int C_out = 64;
static constexpr int H = 108;
static constexpr int W = 180;

__global__ void convolution(const float* inTensor, const float* weights, const float* biases, float* outTensor)
{
	__shared__ float cache[LOCAL_SIZE_X];
	__shared__ float outCache[LOCAL_SIZE_X];
	__shared__ float weightCache[LOCAL_SIZE_X];

	// Global thread indices
	size_t gy = blockIdx.y * blockDim.y + threadIdx.y;
	size_t gz = blockIdx.z * blockDim.z + threadIdx.z;
	// Local thread indices
	size_t lx = threadIdx.x;

	int inSliceSize = H * W;

	int nIdx = int(blockIdx.x);
	int x = gy;
	int y = gz;
	int inIdx = nIdx * C_in * inSliceSize + lx * inSliceSize + y * W + x;
	cache[lx] = inTensor[inIdx];
	weightCache[lx] = weights[lx];
	__syncthreads();

	int halfIdx = lx / 64;

	float outVal = 0;
	int startC = halfIdx * 64;
	for (int inCidx = startC; inCidx < (startC + 64); ++inCidx)
	{
		int weightIdx = (lx - startC) * C_in + inCidx;
		outVal += cache[inCidx] * weightCache[inCidx];
	}
	outCache[lx] = outVal;
	__syncthreads();
	if (halfIdx > 0) return;

	int outSliceSize = H * W;
	int outIdx = nIdx * C_out * outSliceSize + lx * outSliceSize + y * W + x;
	outTensor[outIdx] = outCache[lx] + outCache[lx + 64] + biases[lx];
}

Hello @markus.kivioja84 and welcome to the NVIDIA developer forums!

Thank you for bringing this to our attention!

I will pass it to our internal developers. If I can share any further information, I will follow up here.

I hope your findings do not impede you in your project!

One more question @markus.kivioja84 , could you tell me which Operating system you are using for your tests?

Thanks!

Hi @MarkusHoHo, thank you for the quick response!

The operating system is Windows 10.

1 Like

Hi again @markus.kivioja84 ,

it has been a while and I would like to ask for some more information. So far we have trouble reproducing the behavior you described by just using the provided shaders.

Is it possible for you to share your simple application with us that shows this performance difference on your system? That would help a lot.

Thanks!

Hi @MarkusHoHo ,

I have attached a zip package that contains source codes for runnable programs on CUDA and Vulkan.

When built and run they should both print out the spent GPU time and the output of the computation to verify that the two versions match with each other.

For example, these are the output prints I got:

CUDA:

GPU time: 0.5873 ms.
Output: 129, 129, ..., 129, 129

Vulkan:

GPU time: 19.3922 ms.
Output: 129, 129, ..., 129, 129

Thank you!
nvidia_perf_tests.zip (21.3 KB)

Thank you!

I will forward this to our engineers.

Hi again,

I now received some feedback on your sample app and one thing that was pointed out is that you do not allocate GPU memory for your buffers. You use VK_MEMORY_PROPERTY_HOST which allocates system memory. It is very likely that this causes the performance differences.

Reading your original post where you mention shared memory I interpret it that you indeed want to use on-chip, meaning GPU memory. For that you should look into usage of VK_MEMORY_PROPERTY_DEVICE_LOCAL instead.

I hope this helps!

Hi and thank you for the response!

The compute shader code explicitly copies the data to the on-chip memory, so I thought it shouldn’t matter where the source data is stored in. After the copy cache[lx] = inTensor[inIdx]; the data should be in the same place with both allocation types (VK_MEMORY_PROPERTY_HOST and VK_MEMORY_PROPERTY_DEVICE_LOCAL).

But I guess the reason for the performance difference is then the overhead of the copy from the CPU memory to the on-chip memory which fully hides the benefits of the on-chip memory usage after that.

That would be my assumption as well.

I asked for further clarification, but I will probably not receive feedback before Christmas. Hopefully you can continue your project regardless?

Thank you again for bringing this up! I really appreciate your interest in this topic.