clEnqueueWriteBuffer excessive memory usage

I’m having problems with an OpenCL program, it works fine when I use little arrays, but when I use it in a more realistic case it goes wrong eating up all my host memory.

The code has quite a lot of host-device transfer, but they aren’t a speed problem, since I do the transfert while the CPU is working on something else.

The real problem is that the clEnqueueWriteBuffer function is allocating most of the memory (valgrind + massif says).
Why does it need to allocate memory to transfer from host to device memory? It is allocating a lot, a lot more than I expected, I’d even say it’s allocating another host buffer for every uncompleted transfer.

What’s happening?
How can I ask it to use just the memory I allocate without reallocating buffers of each write?


Nobody? I’m really stuck with this problem…

I somehow isolated the problems, and it looks like a bug to me.

Independently from the size of the data transferred, calling the clEnqueueWriteBuffer function in asynchronous mode results in a memory leakage. This memory is freed only on terminating the program execution.

Here a simple code that reproduce the behaviour:

#include <iostream>

#include <time.h>

#include "CL/cl.hpp"

#include <unistd.h>

int main() {

	size_t MAX_SIZE = 10;

	::clock_t start, finish;

	typedef double ScalarType;

	ScalarType* ram_vec1 = new ScalarType[MAX_SIZE];

	for (unsigned int i = 0; i < MAX_SIZE; ++i) {

		ram_vec1[i] = 1.;


	cl_int err = CL_SUCCESS;

	std::vector<cl::Platform> platforms;


	if (platforms.size() == 0) {

		std::cout << "Platform size 0\n";

		return -1;


	cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties) (platforms[0])(), 0 };

	cl::Context context(CL_DEVICE_TYPE_GPU, properties);

	std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES> ();

	cl::Event event;

	cl::CommandQueue queue(context, devices[0], 0, &err);

	cl::Buffer buffer(context, CL_MEM_WRITE_ONLY, sizeof(ScalarType) * MAX_SIZE, 0, &err);

// Calling clEnqueueWriteBuffer in async mode a million times

	for (int i = 0; i < 1000000; i++) {

		start = clock();

		queue.enqueueWriteBuffer(buffer, CL_FALSE, 0, sizeof(ScalarType) * MAX_SIZE, ram_vec1, 0, &event);


		finish = clock();

		std::cout << "[" << i << "] Time: " << (double(finish - start) / CLOCKS_PER_SEC) << "sec" << std::endl;


        // using up to 1.5 GB of RAM on my computer now

        // the memory usage is constant in these 10 seconds


	std::cout << "Terminating now." << std::endl;

// all the memory is freed upon exit

	return 0;


The problem disappears if I change the lines

queue.enqueueWriteBuffer(buffer, CL_FALSE, 0, sizeof(ScalarType) * MAX_SIZE, ram_vec1, 0, &event);



queue.enqueueWriteBuffer(buffer, CL_TRUE, 0, sizeof(ScalarType) * MAX_SIZE, ram_vec1, 0, 0);

even though I’d like to have more or less the same behaviour.

Obviously in my program I don’t call the event.wait() immediately after the clEnqueueWriteBuffer.

Still I’d like to use async transfers, the whole program is slow otherwise.

I’d really appreciate if anybody can double-check this behaviour and eventually explain it to me.



I found out that the problem isn’t in the NVidia OpenCL implementation, but in the cl.hpp header, therefore it’s up to khronos.
Here is the thread I posted on their forum:

It was just that nobody was releasing the cl_event, due to a strange behaviour of the CommandQueue async methods.