We use openCL for a geophysical program an want to use the hole memory of 6GB of our C2070 Tesla GPUs. Our software development team received a “clEnqueueWriteBuffer: Memory object allocation failure” at 4.3 GB. So we can only use 4.2GB.
We tested a CUDA4 program and it works fine with 6GB.
Is anybody here who could help us? Where is the error?
What values does the OpenCL runtime return for CL_DEVICE_GLOBAL_MEM_SIZE and CL_DEVICE_MAX_MEM_ALLOC_SIZE on your Tesla device (also see this and this thread)? It might be that you’re allocating more than CL_DEVICE_MAX_MEM_ALLOC_SIZE but less than CL_DEVICE_GLOBAL_MEM_SIZE. Currently, it seems there’s no NVIDIA device that supports CL_DEVICE_MAX_MEM_ALLOC_SIZE == CL_DEVICE_GLOBAL_MEM_SIZE, so the only way to access the full CL_DEVICE_GLOBAL_MEM_SIZE amount of memory is to use more than one buffer, each with <= CL_DEVICE_MAX_MEM_ALLOC_SIZE in size.
Hello,
You could use 6GB of memory on 64 bit OS and if your program is compiled in 64 bits otherwise you will 6GB in the GLOBAL_MEM_SIZE but you can allocate only 4GB due to 32bits limitation
Thanks
Jonathan
We have no change to reduce the Data. We bought 6gb C2070 and want to use the whole memory. Is there any change to setup the CL_ADRESS_BITS to 64? (BIOS update, special driver, another linux distribution …)
Hello, I have the same problem with a radar processing application: I can allocate the 6Gb from two different processes running simultaneously, but a single process cannot allocate the full 6Gb (it is limited to 4Gb).
Same as here, I am on CentOS 64bits or Ubuntu 64bits (two different machines, same problem).
By the way, since a single memory object cannot exceed 1/4 of the GPU memory, 32 bits could suffice when kernels use only one or two of the 4 memory objects.
It is suprising that the restriction to 4G is at the PROCESS level because when two processes allocate the full 6G as 4 buffers (2 each) of 1.5G there is no reason that the 2 buffers of one process are not interleaved with the two buffers of the other process (i.e. it would fail if there is someting as a ‘‘base’’ address fo the process and an ‘‘offset’’ for each buffer in the case the order in memory is (buffer 1 of proc 1)-(buffer 1 of proc 2)-(buffer 2 of proc 1)-(buffer 2 of proc 2))
The reason I do not believe that there is such a ‘‘base’’ for each process is that I can start two processes which progressively allocate
1+1G and 1.33+1.33+1.33G
or
1.5+1.5G and 1.5+1.5G
or
1.33+1.33+1.33G and 1+1G
and there is no problem (I guess the interface has no way of predicting how much RAM process #1 will eventually use and set the base for process #2 at:
2G
or
3G
or
4G
in my examples…)
The 32 bits of addressing are more probably restrictive for some housekeeping in the OpenCL interface than at the kernel level (as I guess the 1/4 allocation restricion is due to some segmentation of the graphic RAM in 4 banks for access parallelizing)
We are also using a quite large buffer in our system and noticed some quite weird things. Untill reading this thread i was unaware of this limitation of buffer size usually to 1/4 of the VRam size. Somehow our kernels worked beyond this limit. Our system works flawlessly(only slow due to weak GPU) with a 220MB large buffer on a small Quadro NVS 3100 with 512MB VRam (Max Alloc Size: 128MB). On a Quadro 2000M (2GB Vram, Max Alloc Size: 512MB) it worked up to about 1.8GB with older drivers. We only noticed that there might be a problem as after a driver update to version 300+ all attempts to request a buffer larger than about 1270MB failed (on 2GB 2000M), though there were enougth unoccupied resources. We found a similar behaviour on a Geforce GTX 580 (3GB VRam) with proportionally larger limits.
We are still using buffer sizes beyond the specified allocation size and haven’t experienced any corruption of data. I assume that this size is not always the real possible maximum, but if you go beyond and are lucky to have it still working, you are solely depending on luck at the next device/driver version.
Did someone else encounter similar behaviour? Could it even be considered a bug (missing/wrong condition) in the NVIDIA OpenCL Implementation that API let’s you procceed without returning an error using a buffer larger than the specified maximum size?
Hi,
I am having the same exact problem with Tesla K20c which has 5gb of global memory. I can allocate up to 2.5gb in two separate processes. However, I cannot allocate more than 4gb in one process. I believe this is to do with CL_DEVICE_ADDRESS_BITS. Anybody who has a workaround?
the same problem with the nvidia opencl driver. The CL_DEVICE_ADDRESS_BITS is hardcoded on 32 and it is not possible to allocate more than 4gb.
Here is the answer of the nvidia-support:
'we do not support >4GB memory using OpenCL. We recommend the customer
uses CUDA to access the full 6GB of memory.'
Looks as if nvidia will push cuda with the driver limitation.
Does anyone know what is the current situation? Is the CL_DEVICE_ADDRESS_BITS still hardcoded to 32? If it is not, how do I enable 64-bit mode? I have a Nvidia Quadro K6000 with 12GB memory but I can only access about 3GB through OpenCL. The computer/server is running Ubuntu.