Hi,
I have a strange behaviour with time of computing with function clReleaseMemObject
First example
[codebox]
float *pfTest = (float *)calloc(50 * 500000, sizeof(float));
cl_mem *clTest = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * 500000 * 50, pfTest, NULL)
clReleaseMemObject(clTest);
[/codebox]
In this case time of clReleaseMemObject is 0.01s
Second Example
[codebox]
float *pfTest = (float *)calloc(50 * 500000, sizeof(float));
cl_mem *clTest = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * 500000 * 50, pfTest, NULL)
cl_mem *clTest2 = clCreateBuffer(Context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * 500000 * 50, pfTest, NULL)
Here i Run a Kernel with parameter clTest and clTest2 (for example clTest[iID] = clTest2[iId])
clReleaseMemObject(clTest);
[/codebox]
In the second case time of clReleaseMemObject is 0.2s
Why the fact to add operation Kernel on clTest made that the time of clReleaseMemObject is bigger ???
Thx
J
Possible that clCreateBuffer does NOT actually create the buffer on the device side, until a kernel is run… (creation on demand)
When the kernel is run, the buffer gets created and hence it takes time to free them all…
Hmm…
If that is really the case – Your host buffer must remain un-changed until you launch your kernel… You cannot re-use it immediately after calling “createBuffer”.
I am not sure what OpenCL spec has to say… You can easily try this one out…
Copy a host buffer using createBuffer() and then change the host buffer to some known value and then launch the kernel…
The changed values will then be visible…
If u have time and interest, do try this out…
Thanks for the answer but 0.2s to free the buffer on the device is HUGE !!
i make a lot of kernel with loop with 50 * 500000 and it take only 0.2s and after i have the same time only to free the memory of the device which cancel a little the power of GPU card
Example :
My Program on cpu with 1 core 90s
On a GPU 2.5s and in this 2.5 there is 1.2sec of free memory on the device !!!
I will try your test with the a write buffer
Thx
J
EDIT : i try your test to put a WriteBuffer for example between the create Buffer and the Kernel but the Release Time is always 0.2s
I think there is a bug in the Release device memory because 0.2s is huge
Thx
Hi J,
1.2 sec to free a buffer is defintely crazy… Not sure how cudaFree() would behave. May be, we should profile that first.
If u have a CUDA version of your OpenCL code then you may want to measure it.
btw,
The test that I told you was NOT to minimze the 0.2s latency. I am not sure what you mean by “putting writeBuffer”.
WHat I meant was to “write to the host ptr” after calling clCreateBuffer(). i.e. write some other data onto “pfTest” after createBuffer().
Now,
What data does your kernel see? Is it the data present in “pfTest” when createBuffer was called OR the data present in “pfTest” when the Kernel was launched?? That will answer why there is no latency when there is no kernel launch. Hope that clarifies.
btw, Thanks for all your time and testing stuff…
Best Regards,
Sarnath
Hi,
Thanks for the answer
In fact if i change the value of pfTest between the create Buffer and the launcher of the Kernel the value in clTest was not affect by the modification of pfTest
Thx
J
PS: in my example of 90sec on 1 core CPU and 2.5 sec in a GPU the 1.2 sec to free memory is not for a buffer but for several buffer that’s why i give a very simple example of the problem
In my memory cudaFree was not so long
All rightey,
So the theory flunks. Thanks for testing!
I cannot think of anything else at the momment.
Do post your findings - if you find anything relevant to this problem!
Thank you,
Hi,
With the profiler, i see that the clReleaseMemObject creates a huge memcpyDtoHasync …
Why OpenCL copy device memory to host memory to destroy the device memory ?