Indexing of arrays in constant address space in OpenCL starts at -1 on Tesla C1060

explicat · December 4, 2014, 4:27pm

Hi,

after updating the graphics driver to version 340.58, I discovered some problems with OpenCL on Nvidia’s Tesla C1060.

It turned out that indexing for arrays as kernel parameters placed in constant address space strangely does not start at 0 but at -1.

I created a sample program to verify this. It uses the jogamp library for accessing OpenCL from Java.

import java.io.IOException;
import java.io.InputStream;
import java.nio.FloatBuffer;
import java.nio.IntBuffer;

import com.jogamp.opencl.CLBuffer;
import com.jogamp.opencl.CLCommandQueue;
import com.jogamp.opencl.CLContext;
import com.jogamp.opencl.CLDevice;
import com.jogamp.opencl.CLKernel;
import com.jogamp.opencl.CLMemory;
import com.jogamp.opencl.CLProgram;

public class OpenCLConstantSpaceTester {

	
	public static void main(String[] args) {
		
		int width = 256;
		int height = 256;
		int depth = 256;
		float[] volume = new float[width * height * depth];
		int[] idxFactors = { width*height, width };
		
		CLContext clContext = OpenCLUtil.createContext();
		CLDevice clDevice = clContext.getMaxFlopsDevice();
		
		CLBuffer<IntBuffer> clIdxFactors = clContext.createIntBuffer(idxFactors.length, CLMemory.Mem.READ_ONLY);
		clIdxFactors.getBuffer().put(idxFactors);
		clIdxFactors.getBuffer().rewind();
		
		CLBuffer<FloatBuffer> clVolume = clContext.createFloatBuffer(volume.length, CLMemory.Mem.WRITE_ONLY);
		clVolume.getBuffer().put(volume);
		clVolume.getBuffer().rewind();
		
		// Build OpenCL program and create kernel
		CLProgram program;
		try {
			InputStream input = OpenCLConstantSpaceTester.class.getResourceAsStream("constantTest.cl");
			program = clContext.createProgram(input);
			program.build();
		} catch (IOException e) {
			throw new RuntimeException("The kernel file could not be loaded.");
		}
		
		
		CLCommandQueue queue = clDevice.createCommandQueue();
		queue.putWriteBuffer(clIdxFactors, true).finish();
		
		CLKernel clKernel = program.createCLKernel("constantTest");
		clKernel.putArg(width)
			.putArg(height)
			.putArg(depth)
			.putArg(clIdxFactors)
			.putArg(clVolume);
		
		int maxGroupSize = (int) Math.pow(clKernel.getWorkGroupSize(clDevice), 1.0/3);
		if ((maxGroupSize + 1) * (maxGroupSize + 1) * (maxGroupSize + 1) <= clKernel.getWorkGroupSize(clDevice)) {
			maxGroupSize += 1;
		}
		
		int groupSize = Math.min(maxGroupSize, 8);
		queue.put3DRangeKernel(clKernel, 0, 0, 0, OpenCLUtil.roundUp(groupSize, width), OpenCLUtil.roundUp(groupSize, height), OpenCLUtil.roundUp(groupSize, depth), groupSize, groupSize, groupSize);
		queue.flush();
		queue.finish();
		
		queue.putReadBuffer(clVolume, true).finish();
		clVolume.getBuffer().rewind();
		clVolume.getBuffer().get(volume);
		clVolume.getBuffer().rewind();
		
		System.out.println("Result should be " + idxFactors[0] + ", " + idxFactors[1]);
		System.out.println(volume[0]);
		System.out.println(volume[1]);
		
		clKernel.release();
		queue.release();
		clVolume.release();
		clIdxFactors.release();
		program.release();
	}
}

And the kernel from constantTest.cl:

__kernel void constantTest(
	int width,
	int height,
	int depth,
	__constant int* idxFactors,
	__global float* volume)
{
	int x = get_global_id(0);
	int y = get_global_id(1);
	int z = get_global_id(2);

	if (x >= width || y >= height || z >= depth) {
		return;
	}

	unsigned long idx = z * idxFactors[0] + y * idxFactors[1] + x;

	// Accessing volume[idx] crashes, so let's have a look at the values of idxFactors
	volume[0] = (float) idxFactors[0];
	volume[1] = (float) idxFactors[1];
}

Basically, an array idxFactors with idxFactors[0] = 65536 and idxFactors[1] = 256 is copied to the GPU. The GPU just stores idxFactors[0] in the volume array at position 0 and idxFactors[1] at position 1. Then the volume array is copied back into host’s memory.

So I’d expect volume[0] = idxFactors[0] = 65536 and volume[1] = idxFactors[1] = 256.
However, the actual outcome shows volume[0] = 256.0 and volume[1] = 1.06387091E9

Decreasing the indexes of idxFactors by one leads to the desired output. So I changed the last two lines of the kernel to

volume[0] = (float) idxFactors[-1];
volume[1] = (float) idxFactors[0];

and got volume[0] = 65536.0 and volume[1] = 256.0

On another computer the expected results are received without decreasing the starting index to -1.

So why does indexing for arrays in constant address space start at -1?
Could this be a bug in the graphics driver?

njuffa · December 4, 2014, 7:34pm

I would suggest filing a bug through the form linked from the registered developer website. Note that the most recent drivers dropped support for sm_1x devices, but by my (possibly incorrect) understanding r340 drivers are supposed to retain support for sm_1x platforms for another year or so. Please refer to this deprecation notice:

[url]Error | NVIDIA

I guess the question is whether driver 340.51 belongs to the r340 family (the numerics would suggest it).

Robert_Crovella · December 4, 2014, 7:57pm

I believe the referenced 340.58 is actually a linux driver. However the deprecation notice is equally (although not explicitly) applicable to linux support, and yes, 340.58 is an r340 branch driver. r340, whether windows or linux, is the last driver branch to officially support the devices listed in the (windows) deprecation notice. This is true for all functionality, and is not limited to compute functions. Drivers beyond r340 will not officially support the listed devices for any use/purpose.