GTX 780 Ti as seen by OpenCl

I realize this isn’t perfectly on topic, but there is no OpenCl subforum and I figured CUDA is pretty close.

I’m trying to write an algorithm in OpenCl which should be optimized for GTX 780 Ti, but I don’t have access to the card just yet (so I cannot interrogate it or run benchmarks).

I am having trouble locating the specifications for the GTX 780 Ti in terms of what is seen by opencl.

For instance
Processing Elements: 2880
Compute Units: ?
Wavefronts(warps) / CU ?

Private Memory / PE:               ?
Local Memory / CU:                 ?
Constant Memory:                   ?
General Memory:                    ?

Bytes/read from General Memory:    ?
Other things I should optimize for?

Is there a place to look this stuff up that I do not know about? I realize CUDA uses different terms for similar concepts, so perhaps there are Cuda specs and a cuda -> opencl dictionary?

This is the best I have found until now:

Also are there other things I should be aware of performance wise? I am trying to learn both OpenCL and GPU computing in general through this project, so any clues I might have missed would be useful.

Thank you in advance to any respondents.

None of this is a specific answer to your question about 780 Ti, however some of it may help.

The relationship between CUDA terms and OpenCL terms is covered in many places, one such place is this AMD document:

I believe if you use the above terminology translation, you can answer some of your questions about 780 Ti OpenCL resources, such as Private Memory, etc.

Also, NVIDIA published some documents, which are now old, which may still be of interest:

Thank you, I’ll have a look.

I’ll fill in findings here as I make them in case anyone finds this while looking for the same thing.

GTX 780 Ti;
Global Memory: 3072 MB?
Constant Memory: 64KB
Local Memory (CUDA: Shared Memory): 48KB
Private Memory (CUDA: Local Memory): 512KB

I find this odd, I would have expected local memory to be larger, since it is a joint pool for many processing elements, while private memory is per processing element. I guess I can simulate a larger local memory by storing data in private memories, and writing to local memory when I need to broadcast it.

The local memory/private memory is not a separate physical resource, but instead a logical space. The 512KB number is the maximum size of this logical space per thread (it is thread-local/thread-private). Regarding the physical resource backing of local memory/private memory, it uses either registers or, effectively, global memory (ie. the on-board DDR memory) (including the cache-backing of global memory). shared memory/local memory, on the other hand, is a physical resource, as well as a logical space. As a physical resource, it resides on-chip, which helps to explain its “relatively small” size, and the fact that it is not a “per-thread” resource, since it is shard amongst all threads in a threadblock, which can be a variable number. The size in this case is fixed to the size of the physical resource (ignoring for the moment the slight differentiation in cc5.x between physical size and logical size of this resource).

Then private memory as I had understood it consists only of the registers, but the address space is 512KB and data gets “paged” to global memory. I was wondering why there was so much of that in relation to local(cuda:shared) memory.

I should try to keep private(cuda:local) memory confined to the registers then, and store additional data, which I need to access soon, in local(cuda:shared) memory rather than private (paged to global) memory.

Thank you for clearing this up for me, you may have saved me a lot of head-scratching.