I’m learning how to use Optix on Fedora 35 Linux, CUDA 11.5, Optix 7.4, GPU driver 495.29.05 and a RTX 3060.
I have a program that displays some simple cubes and I am looking at a profiling run using nv-nsight-cu.
When I look at the details page of the report it says I have low compute throughput (50%), memory throughput (30.94%) and DRAM throughput (25.07%) and it’s suggested I look at scheduler and warp state statistics. It’s also suggested I look at source counters.
The source counters section mentions several uncoalesced global accesses where one has an unexpected/unexcpected ratio of 2.0 and the others have a ratio of 1.0, where I think 1.0 is ok.
The case where the ratio is 2.0 is a simple array assignment to a uchar4 array element aligned to 4 bytes.
params.image[idx.y * params.image_width + idx.x] = charResult;
Reading documentation, it seems that my code is setting alignment correctly to 4 bytes so I’m wondering what I’m missing.
The occupancy section of the report says theoretical occupancy is limited to 33% due to required registers and due to shared memory.
Do I have any control of shared memory with Optix? I don’t believe I am declaring any myself.
How do I limit the number of registers I use? Do I keep the scope between where I set a variable and where I use it as small as possible to minimize the lifetime of the variable?
I tried compiling to PTX with tthe nvcc -O3 optimization level and then setting OPTIX_COMPILE_OPTIMIZATION_LEVEL_3 and OPTIX_COMPILE_DEBUG_LEVEL_MODERATE and that didn’t seem to change anything.
My raygen code which includes the setting of the uchar4 array element follows
static __forceinline__ __device__ void computeRay(uint3 idx, uint3 dim, float3 &origin, float3 &direction) {
const float3 U = params.camU;
const float3 V = params.camV;
const float3 W = params.camW;
const float2 d = 2.0f * make_float2(static_cast<float>(idx.x) / static_cast<float>(dim.x), static_cast<float>(idx.y) / static_cast<float>(dim.y)) - 1.0f;
origin = params.camEye;
direction = normalize(d.x * U + d.y * V + W);
}
extern "C" __global__ void __raygen__rg() {
// Map our launch idx to a screen location and create a ray from the camera
// location through the screen
float3 rayOrigin;
float3 rayDirection;
// Lookup our location within the launch grid
const uint3 idx = optixGetLaunchIndex();
const uint3 dim = optixGetLaunchDimensions();
computeRay(idx, dim, rayOrigin, rayDirection);
// Trace the ray against our scene hierarchy
unsigned int depth = 0;
unsigned int p0;
unsigned int p1;
unsigned int p2;
optixTrace(params.handle, rayOrigin, rayDirection, 0.0f, // Min intersection distance
100.0f, // Max intersection distance
0.0f, // rayTime -- used for motion blur
OptixVisibilityMask(255), // Specify always visible
OPTIX_RAY_FLAG_NONE, // No ray flags
NORMAL_RAY, // SBT offset -- See SBT discussion
NUM_RAY_TYPES, // SBT stride -- See SBT discussion
NORMAL_RAY, // missSBTIndex -- See SBT discussion
depth, p0, p1, p2);
float4 result;
result.x = int_as_float(p0);
result.y = int_as_float(p1);
result.z = int_as_float(p2);
result.w = 1.0f;
// Record results in our output raster
uchar4 charResult = make_color(result);
params.image[idx.y * params.image_width + idx.x] = charResult;
}
Thanks, Dave