Hello.
I’m writing a pathtracer using an OpenGL compute shader which writes color values to a texture, and then I draw that texture on a fullscreen quad using a different shader program. My loop consists of the following:
- Clearing the color buffer
- glUseProgram for the compute shader program, glDispatchCompute followed by glMemoryBarrier
- glUseProgram for the fullscreen quad shader program, glDrawArrays
That’s all.
After simplisitc profiling, 99% of the GPU time is spent in the compute shader which is kind of expected, with a frame time of ~64ms on average. I tested the same program out on an AMD laptop which is significantly lower spec, (see below), and I got ~30ms per frame. This was completely unexpected and I can’t really put my finger on why this is the case.
More details: I use a single texture to act as the drawing texture that gets accessed by the compute shader by binding it to an image unit (image2D in the shader), and in the normal fullscreen quad shader program i just use a sampler2D in the fragment shader and output its value. I have one uvec2 uniform, which shouldn’t be a problem, and I have 5 SSBOs which all contain a dynamic array of custom GLSL structs. I will add a snippet to show the struct alignment:
// SSBO helper structs
struct Triangle
{
vec4 v0v1; // v0.x, v0.y, v0.z, v1.x
vec4 v1v2; // v1.y, v1.z, v2.x, v2.y
vec4 v2norm; // v2.z, n.x, n.y, n.z
vec4 e1e2; // e1.x, e1.y, e1.z, e2.x
vec4 e2matX; // e2.y, e2.z, mat_index, empty
};
struct Sphere
{
vec4 sphere_data; // o.x, o.y, o.z, radius
uvec4 mat_index;
};
struct Material
{
vec4 type_diffuse; // mat_type, diff.x, diff.y, diff.z
vec4 specular_spec; // spec.x, spec.y, spec.z, n_spec
vec4 Le; // Le.x, Le.y, Le.z, empy
};
struct AABB
{
vec4 data1; // bmin.xyz, bmax.x
vec4 data2; // bmax.yz
};
struct BVHNode
{
AABB node_AABB;
uvec4 data; // left/first_tri, num_tris
};
// SSBOs
layout(std430, binding = 1) readonly buffer SpheresSSBO
{
Sphere spheres[];
} spheres_ssbo;
layout(std430, binding = 2) readonly buffer ModelTrisSSBO
{
Triangle triangles[];
} model_tris_ssbo;
layout(std430, binding = 3) readonly buffer ModelLightTrisSSBO
{
uint light_tri_indices[];
};
layout(std430, binding = 4) readonly buffer MaterialsSSBO
{
Material materials[];
} materials_ssbo;
layout(std430, binding = 5) readonly buffer BVHSSBO
{
BVHNode bvh_nodes[];
} bvh_ssbo;
I could optimize some of these to be UBOs and therefore not be in global memory (like an SSBO might be), but not all. The rest of the compute shader is normal raytracing math. I use imageLoad in the begining of main() to average out frames, and imageStore to store the value to the image2D. There is little to no difference if I omit the imageLoad.
What could be the cause of such slowdown? It’s 2x slower than on my AMD laptop, which is pretty low end.
Specs:
OS: Windows 10 64bit
GPU: GTX 1050 (2GB)
Driver Version: 516.59
OpenGL: 4.6 Core