OpenGL Compute Shader unusually slow

sivanovski.dev · July 8, 2022, 11:02am

Hello.

I’m writing a pathtracer using an OpenGL compute shader which writes color values to a texture, and then I draw that texture on a fullscreen quad using a different shader program. My loop consists of the following:

Clearing the color buffer
glUseProgram for the compute shader program, glDispatchCompute followed by glMemoryBarrier
glUseProgram for the fullscreen quad shader program, glDrawArrays
That’s all.

After simplisitc profiling, 99% of the GPU time is spent in the compute shader which is kind of expected, with a frame time of ~64ms on average. I tested the same program out on an AMD laptop which is significantly lower spec, (see below), and I got ~30ms per frame. This was completely unexpected and I can’t really put my finger on why this is the case.

More details: I use a single texture to act as the drawing texture that gets accessed by the compute shader by binding it to an image unit (image2D in the shader), and in the normal fullscreen quad shader program i just use a sampler2D in the fragment shader and output its value. I have one uvec2 uniform, which shouldn’t be a problem, and I have 5 SSBOs which all contain a dynamic array of custom GLSL structs. I will add a snippet to show the struct alignment:

// SSBO helper structs

struct Triangle
{
	vec4 v0v1;   // v0.x, v0.y, v0.z, v1.x
	vec4 v1v2;   // v1.y, v1.z, v2.x, v2.y
	vec4 v2norm; // v2.z, n.x, n.y, n.z
	vec4 e1e2;   // e1.x, e1.y, e1.z, e2.x
	vec4 e2matX; // e2.y, e2.z, mat_index, empty
};

struct Sphere
{
	vec4 sphere_data; // o.x, o.y, o.z, radius
	uvec4 mat_index;
};

struct Material
{
	vec4 type_diffuse;  // mat_type, diff.x, diff.y, diff.z
	vec4 specular_spec; // spec.x, spec.y, spec.z, n_spec
	vec4 Le;            // Le.x, Le.y, Le.z, empy
};

struct AABB
{
	vec4 data1; // bmin.xyz, bmax.x
	vec4 data2; // bmax.yz
};

struct BVHNode
{
	AABB node_AABB;
	uvec4 data; // left/first_tri, num_tris
};

// SSBOs

layout(std430, binding = 1) readonly buffer SpheresSSBO
{
	Sphere spheres[];
} spheres_ssbo;

layout(std430, binding = 2) readonly buffer ModelTrisSSBO
{
	Triangle triangles[];
} model_tris_ssbo;

layout(std430, binding = 3) readonly buffer ModelLightTrisSSBO
{
	uint light_tri_indices[];
};

layout(std430, binding = 4) readonly buffer MaterialsSSBO
{
	Material materials[];
} materials_ssbo;

layout(std430, binding = 5) readonly buffer BVHSSBO
{
	BVHNode bvh_nodes[];
} bvh_ssbo;

I could optimize some of these to be UBOs and therefore not be in global memory (like an SSBO might be), but not all. The rest of the compute shader is normal raytracing math. I use imageLoad in the begining of main() to average out frames, and imageStore to store the value to the image2D. There is little to no difference if I omit the imageLoad.

What could be the cause of such slowdown? It’s 2x slower than on my AMD laptop, which is pretty low end.

Specs:
OS: Windows 10 64bit
GPU: GTX 1050 (2GB)
Driver Version: 516.59
OpenGL: 4.6 Core

MarkusHoHo · July 11, 2022, 1:34pm

Hello @sivanovski.dev ,

would you mind sharing the specs of you “low end” AMD laptop as well for comparison?

And ideally share the app for people to more easily reproduce the behavior just like you experience it?

Just based on the above shader code it is really difficult to say anything about performance, whether it is expected or whether there might be some (not so obvious) optimization issues. Lots of things like mem alignment or how the shader compiler behaves can influence this. Even the way you generate your rays might have an impact.

Did you look at NSight to try some more in-depth profiling?

Sorry if I can’t be of more help right away.

Thanks!

sivanovski.dev · July 11, 2022, 3:57pm

Just before I saw your reply I found the issue. It was me skipping over a part of my shader that I thought was irrelevant, where I (while initial porting to GLSL from C++) used an allocated array of uints with a length of 200. This was somehow causing the slowdown, probably by being dumped to global memory? By reducing the number of elements to 10-30 I got an instant speedup and now the shader takes ~14ms. Register counts didn’t change.

system · July 25, 2022, 3:57pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Poor OpenGL rendering : software mode ? OpenGL	6	3990	December 21, 2012
Compute Shader Performance Vulkan	11	8215	June 8, 2016
Compute shader causing internal compiler error OpenGL	8	2920	July 31, 2016
OpenGL Compute Shader SSBO Write Performance Issue OpenGL	1	2194	November 1, 2017
OpenGL 4.4 very slow - OpenGL 1.1 very fast - Performance Problem Quadro K4200/K2000 OpenGL	1	3168	January 26, 2016
GL_ARB_gpu_shader_int64 compiler breaks code logic. Linux	0	746	November 23, 2017
Very low TEX hit rate when profiling OpenGL compute shader OpenGL	1	662	July 18, 2022
Long compute shader compile/link time with large SSBO size OpenGL	0	978	May 21, 2019
cuda 3: cudaGraphicsMapResources performance issue linux 32bit, driver 195.30, macbookpro nvidia 960 CUDA Programming and Performance	3	4058	March 19, 2010
Low performance and high CPU usage CUDA Programming and Performance	13	19458	August 7, 2007

OpenGL Compute Shader unusually slow

Related topics