Persistent storage on nvidia opencl GPU

Hi. Is there any way for the gpu to hold persistent storage of data so that we don’t have to repeatedly the data send, every time we invoke the kernel.

For example if the kernel takes 2 arguements - constant_data(let us say this is 10000 bytes and the data is constant) and variable_data(say 1000-10000 bytes and the data varies between every invocation of the kernel), and we invoke the kernel 10000 times each time with the same constant_data but different variable_data, I have to bear the extra overhead of sending the same constant_data, inspite of this data being constant over multiple kernel invocations. So over multiple kernel invokes, can the constant data be stored in the GPU, so that for the next invocation of the kernel, I don’t have to send the constant data, but I only send the variable data?

Thanks

Sorry for the grammatical errors. Re-posting after correction

Hi. Is there any way for the GPU to have persistent data storage(across kernel invocations), so that we don’t have to repeatedly send the data, every time we invoke the kernel.

For example if the kernel takes 2 arguments - constant_data(let us say this is 10000 bytes and the data is constant) and variable_data(say 1000-10000 bytes and the data varies between every invocation of the kernel), and we invoke the kernel from the host program 10000 times, each time with the same constant_data, but different variable_data, I have to bear the extra overhead of sending the same constant_data, inspite of this data being constant across kernel invocations. So over multiple kernel invokes, can the constant data be stored in the GPU, so that for the next invocation of the kernel, I don’t have to send the constant data, but I only send the variable data?

A case where this issues comes up is pattern matching. For example, if I have a kernel which implements the pattern matching algorithm, and I receive data as a stream in the host and the host invokes the kernel multiple times by sending as arguments a chunk of the data stream, and the set of patterns(where the set of patterns is always constant), then one has to bear the cost of sending the patterns, again when the kernel is invoked with the next chunk of data by the host. In such a case, it would be useful if one can store the patterns in the GPU and the kernel invocation from the host just supplies the next chunk of data against which the pattern matching has to be done.

Thanks

Anything you cudaMemcpy() to the card will be persistent until your program exits, or until you cudaFree() the memory.

Thanks for replying.

So does clCreateBuffer() or clSetKernelArg, use cudaMemcpy()?

How can I find out the opencl API calls, which call cudaMemcpy() at the background?

When I call clSetKernelArg() for a kernel, does the memory object get copied to the GPU immediately, or will it get copied to the GPU only when we do a clEnqueueNDRangeKernel()?

Whoa, reading failure on my part. I didn’t realize I was reading the OpenCL forum rather than the CUDA forum. I have no idea what I’m talking about (haven’t played with OpenCL yet), so you should ignore me. :)

Oh :). No problem. Thanks for taking time to reply anyways.

OpenCL has the same behaviour as CUDA - global and constant memory is persistent across kernel invocations. Data is only copied from the host to the device when you do clEnqueueWriteBuffer().

One quick question. When I write using clEnqueueWriteBuffer(), to which address space does the memory object get copied to? Or for a particualr kernel, the address space for the argument is resolved based on what qualifier that argument has been given in the kernel parameter list?