Hi,
after updating the graphics driver to version 340.58, I discovered some problems with OpenCL on Nvidia’s Tesla C1060.
It turned out that indexing for arrays as kernel parameters placed in constant address space strangely does not start at 0 but at -1.
I created a sample program to verify this. It uses the jogamp library for accessing OpenCL from Java.
import java.io.IOException;
import java.io.InputStream;
import java.nio.FloatBuffer;
import java.nio.IntBuffer;
import com.jogamp.opencl.CLBuffer;
import com.jogamp.opencl.CLCommandQueue;
import com.jogamp.opencl.CLContext;
import com.jogamp.opencl.CLDevice;
import com.jogamp.opencl.CLKernel;
import com.jogamp.opencl.CLMemory;
import com.jogamp.opencl.CLProgram;
public class OpenCLConstantSpaceTester {
public static void main(String[] args) {
int width = 256;
int height = 256;
int depth = 256;
float[] volume = new float[width * height * depth];
int[] idxFactors = { width*height, width };
CLContext clContext = OpenCLUtil.createContext();
CLDevice clDevice = clContext.getMaxFlopsDevice();
CLBuffer<IntBuffer> clIdxFactors = clContext.createIntBuffer(idxFactors.length, CLMemory.Mem.READ_ONLY);
clIdxFactors.getBuffer().put(idxFactors);
clIdxFactors.getBuffer().rewind();
CLBuffer<FloatBuffer> clVolume = clContext.createFloatBuffer(volume.length, CLMemory.Mem.WRITE_ONLY);
clVolume.getBuffer().put(volume);
clVolume.getBuffer().rewind();
// Build OpenCL program and create kernel
CLProgram program;
try {
InputStream input = OpenCLConstantSpaceTester.class.getResourceAsStream("constantTest.cl");
program = clContext.createProgram(input);
program.build();
} catch (IOException e) {
throw new RuntimeException("The kernel file could not be loaded.");
}
CLCommandQueue queue = clDevice.createCommandQueue();
queue.putWriteBuffer(clIdxFactors, true).finish();
CLKernel clKernel = program.createCLKernel("constantTest");
clKernel.putArg(width)
.putArg(height)
.putArg(depth)
.putArg(clIdxFactors)
.putArg(clVolume);
int maxGroupSize = (int) Math.pow(clKernel.getWorkGroupSize(clDevice), 1.0/3);
if ((maxGroupSize + 1) * (maxGroupSize + 1) * (maxGroupSize + 1) <= clKernel.getWorkGroupSize(clDevice)) {
maxGroupSize += 1;
}
int groupSize = Math.min(maxGroupSize, 8);
queue.put3DRangeKernel(clKernel, 0, 0, 0, OpenCLUtil.roundUp(groupSize, width), OpenCLUtil.roundUp(groupSize, height), OpenCLUtil.roundUp(groupSize, depth), groupSize, groupSize, groupSize);
queue.flush();
queue.finish();
queue.putReadBuffer(clVolume, true).finish();
clVolume.getBuffer().rewind();
clVolume.getBuffer().get(volume);
clVolume.getBuffer().rewind();
System.out.println("Result should be " + idxFactors[0] + ", " + idxFactors[1]);
System.out.println(volume[0]);
System.out.println(volume[1]);
clKernel.release();
queue.release();
clVolume.release();
clIdxFactors.release();
program.release();
}
}
And the kernel from constantTest.cl:
__kernel void constantTest(
int width,
int height,
int depth,
__constant int* idxFactors,
__global float* volume)
{
int x = get_global_id(0);
int y = get_global_id(1);
int z = get_global_id(2);
if (x >= width || y >= height || z >= depth) {
return;
}
unsigned long idx = z * idxFactors[0] + y * idxFactors[1] + x;
// Accessing volume[idx] crashes, so let's have a look at the values of idxFactors
volume[0] = (float) idxFactors[0];
volume[1] = (float) idxFactors[1];
}
Basically, an array idxFactors with idxFactors[0] = 65536 and idxFactors[1] = 256 is copied to the GPU. The GPU just stores idxFactors[0] in the volume array at position 0 and idxFactors[1] at position 1. Then the volume array is copied back into host’s memory.
So I’d expect volume[0] = idxFactors[0] = 65536 and volume[1] = idxFactors[1] = 256.
However, the actual outcome shows volume[0] = 256.0 and volume[1] = 1.06387091E9
Decreasing the indexes of idxFactors by one leads to the desired output. So I changed the last two lines of the kernel to
volume[0] = (float) idxFactors[-1];
volume[1] = (float) idxFactors[0];
and got volume[0] = 65536.0 and volume[1] = 256.0
On another computer the expected results are received without decreasing the starting index to -1.
So why does indexing for arrays in constant address space start at -1?
Could this be a bug in the graphics driver?