clEnqueueWriteBuffer returns CL_OUT_OF_RESOURCES

I got this problem only if I call clEnqueueWriteBuffer after running the first kernel.

The following code work fine on ATI SDK with ATI card.

I checked the OpenCL specification and I couldn’t find any information about the

CL_OUT_OF_RESOURCES error code for clEnqueueWriteBuffer.

Please help. Thanks.

Here is my first kernel

float4 mulMat4Vec4(const float16 mat, const float4 vec)

{

	float4 ret;

	//float4 c1 = mat.s0123 * vec.s0;

	//float4 c2 = mat.s4567 * vec.s1;

	//float4 c3 = mat.s89ab * vec.s2;

	//float4 c4 = mat.scdef * vec.s3;

	//ret.s0 = c1.s0 + c2.s0 + c3.s0 + c4.s0;

	//ret.s1 = c1.s1 + c2.s1 + c3.s1 + c4.s1;

	//ret.s2 = c1.s2 + c2.s2 + c3.s2 + c4.s2;

	//ret.s3 = c1.s3 + c2.s3 + c3.s3 + c4.s3;

	ret.s0 = mat.s0 * vec.s0 + mat.s4 * vec.s1 + mat.s8 * vec.s2 + mat.sc * vec.s3;

	ret.s1 = mat.s1 * vec.s0 + mat.s5 * vec.s1 + mat.s9 * vec.s2 + mat.sd * vec.s3;

	ret.s2 = mat.s2 * vec.s0 + mat.s6 * vec.s1 + mat.sa * vec.s2 + mat.se * vec.s3;

	ret.s3 = mat.s3 * vec.s0 + mat.s7 * vec.s1 + mat.sb * vec.s2 + mat.sf * vec.s3;

	return ret;

}

__kernel void transformKernel(__global RTObject* objects,

							  const uint numPrimitives,

							  __global RTPrimitive* primitives,

							  __global RTPrimitive* outPrimitives)

{

	size_t i = get_global_id(0);

	if(i >= numPrimitives)

	{

		return;

	}

	//Transform vertices

	uint objectID = primitives[i].objectIndex;

	RTVertex vertexA = primitives[i].vertexA;

	RTVertex vertexB = primitives[i].vertexB;

	RTVertex vertexC = primitives[i].vertexC;

	float16 matrix = objects[objectID].transformation;

	//Positions

	outPrimitives[i].vertexA.position = mulMat4Vec4(matrix, vertexA.position);

	outPrimitives[i].vertexB.position = mulMat4Vec4(matrix, vertexB.position);

	outPrimitives[i].vertexC.position = mulMat4Vec4(matrix, vertexC.position);

	//Normals

	outPrimitives[i].vertexA.normal = mulMat4Vec3(matrix, vertexA.normal);

	outPrimitives[i].vertexB.normal = mulMat4Vec3(matrix, vertexB.normal);

	outPrimitives[i].vertexC.normal = mulMat4Vec3(matrix, vertexC.normal);

	//object index

	outPrimitives[i].objectIndex = objectID;

	outPrimitives[i].primitiveType = primitives[i].primitiveType;

}

and here is the code that has the problem

//enqueue the first kernel

	cl_uint iParam = 0;

	cl_uint err = clSetKernelArg(m_knlTransform, iParam++, sizeof(cl_mem), &m_memObjects);

	err = clSetKernelArg(m_knlTransform, iParam++, sizeof(cl_uint), &m_iNumPrimitives);

	err = clSetKernelArg(m_knlTransform, iParam++, sizeof(cl_mem), &m_primitiveBuffer);

	err = clSetKernelArg(m_knlTransform, iParam++, sizeof(cl_mem), &m_memPrimitives);

	workDim = 1;

	globalSize[0] = (m_iNumPrimitives/128+1)*128;

	workSize[0] = 128;

	workDone = m_pDevice->EnqueueNDRangeKernel(m_knlTransform, workDim, globalSize, workSize);

		clWaitForEvents(1, &workDone);

	clReleaseEvent(workDone);

	

		err = clEnqueueWriteBuffer(m_pDevice->GetCommandQueue(), m_memCamera, CL_TRUE, 0, camSize, &m_rtCam, 0, NULL, NULL);

this is the code that create buffer

m_primitiveBuffer = clCreateBuffer(m_pDevice->GetContext(), CL_MEM_READ_WRITE, 1024*sizeof(RTPrimitive), NULL, &err);

	m_memPrimitives = clCreateBuffer(m_pDevice->GetContext(), CL_MEM_READ_WRITE, 1024*sizeof(RTPrimitive), NULL, &err);

In reality, clEnqueueNDRangeKernel returns that error - try putting clFinish() between EnqueueNDRangeKernel and EnqueueWriteBuffer, you’ll see clFinish returning it then. Many people are fighting similar problems :( Try having less than 128 work items per group maybe.

In reality, clEnqueueNDRangeKernel returns that error - try putting clFinish() between EnqueueNDRangeKernel and EnqueueWriteBuffer, you’ll see clFinish returning it then. Many people are fighting similar problems :( Try having less than 128 work items per group maybe.

I tried to check the error code of clFlush and clFinish between EnqueueNDRangeKernel and EnqueueWriteBuffer but both of them return CL_SUCCESS with any work items per groups (64 - 512) :(

I tried to check the error code of clFlush and clFinish between EnqueueNDRangeKernel and EnqueueWriteBuffer but both of them return CL_SUCCESS with any work items per groups (64 - 512) :(

As I said on several other places here on forum, I’m sometimes experiencing CL_OUT_OF_RESOURCES on really simple and short kernels, which run OK alone, but if I run any other kernels before, they return this. I had similar code as you and have NO explanation at all to what I did to it to run OK :o Sometimes my kernels even hang up the OS. My config is Win 2003 R2 x64 and GF 8800 GT, my thesis advisor has Win7 x64 and GF 8400GS only and my code works for him well. I’m becoming desperate :( CL_OUT_OF_RESOURCES doesn’t say anything. I could post my whole project here, but it’s way too big for anyone to dig in :(

As I said on several other places here on forum, I’m sometimes experiencing CL_OUT_OF_RESOURCES on really simple and short kernels, which run OK alone, but if I run any other kernels before, they return this. I had similar code as you and have NO explanation at all to what I did to it to run OK :o Sometimes my kernels even hang up the OS. My config is Win 2003 R2 x64 and GF 8800 GT, my thesis advisor has Win7 x64 and GF 8400GS only and my code works for him well. I’m becoming desperate :( CL_OUT_OF_RESOURCES doesn’t say anything. I could post my whole project here, but it’s way too big for anyone to dig in :(

I don’t see the declaration of this variable (m_memCamera) so i don’t know if you did a mistake or not with this ^^’

struct RTCamera

{

cl_float4	position;

cl_float4	direction;

cl_float4	up;

cl_float4	right;

cl_uint2	screenSize;

cl_float	fNear;

};

m_memCamera = clCreateBuffer(m_pDevice->GetContext(), CL_MEM_READ_ONLY, sizeof(RTCamera), NULL, &err);

here is the definition of m_memCamera. Did I do anything wrong?

Another point I found after I did the workaround by doing the first kernel job in C++ instead is that Nvidia SDK tends to optimize my program memory.

vertexA of the following structure can’t be address directly in cl. I need to put padding variables so it can be aligned to 32bytes :( Feeling so frustrate with the current

OpenCL especially when I have to make it works on both ATI and Nvidia.

struct RTPrimitive

{

cl_uint		primitiveType;	//Type of primitive 0 = ignore, 1 = triangle, 2 = sphere, ...

RTVertex	vertexA;

RTVertex	vertexB;

RTVertex	vertexC;

cl_uint		objectIndex;	//Index of object in objectBaffer

};

There is something i don’t understand : why are you using m_memCamera ? You don’t ever use it in your cl code. And you are trying to write data in this cl_mem that has nothing in it. You just reserved the memory to use it in the kernel.

Another point :
in the declaration of m_memCamera, you put the flag CL_MEM_READ_ONLY, That means you want to read data so you need to put data in it before sending it to the kernel. The problem is you don’t say with what var you want to fill m_memCamera. Thats why he may not be able to do the clEnqueueWriteBuffer. you should write :

m_memCamera = clCreateBuffer(m_pDevice->GetContext(), CL_MEM_READ_ONLY, sizeof(RTCamera), &m_rtCam, &err);

if it is really data you want to read only.

I hope it helps :)

Hi

m_memCamera is used in the other kernel. The kernel code I put here is the first kernel that if I remove it, I don’t have any problem with clEnqueueWriteBuffer. Also the m_memCamera is read-only from the kernel point of view but not for the entire application. It has been update every frame.

Thanks for your suggestion.