The Next Wave of Enterprise Performance with Java, POWER Systems and NVIDIA GPUs

Originally published at:

The Java ecosystem is the leading enterprise software development platform, with widespread industry support and deployment on platforms like the IBM WebSphere Application Server product family. Java provides a powerful object-oriented programming language with a large developer ecosystem and developer-friendly features like automated memory management, program safety, security and runtime portability, and high performance features…

I'm just wondering how CUDA4J differs from available and open source JNI wrappers (such as JCuda)
It seems that the programmer is again responsible for everything.

Hi Mehran,

You are correct that CUDA4J is a toolkit where the programmer is working with concepts from the CUDA programming model. However, it is designed to provide an API that is consistent with Java best practices rather than providing a direct 1:1 mapping to the CUDA runtime and driver APIs.

We believe the result is an API that is more concise, more natural to Java programmers, more secure, and less error-prone than existing offerings. The CUDA4J API is available to application developers, but it also forms the basis for the higher-level GPU usage from existing Java APIs, as described in the article.


Hello Tim,
Thank you for your reply. Will CUDA4J be publicly available anytime soon?

Mehran, you'll appreciate that I cannot pre-announce availability, but you will have seen the recent press releases around IBM Power 8 support for Nvidia GPUs, and you can keep an eye on IBM developerWorks for our latest Java SDK releases for 64-bit Power LE hardware here:

Tim, I understand that and thank you for your response.

Hi Tim,

I'm a graduate student doing research involving some simple string processing using MapReduce and the GPU. Currently, I'm having to take my strings out of their collection and translate them into arrays of pointers. Will CUDA4J be able to use complex types or will it be limited to C primitives?

CUDA4J supports transferring data found in NIO buffers or in arrays of primitive types. The best performance is obtained using a ByteBuffer allocated via Cuda.allocatePinnedHostBuffer().

If your 'complex' type is a POD C structure, you can create simple API to represent an array of your objects stored in a ByteBuffer. JIT compilers found in modern JVMs do a remarkable job eliminating the apparent overhead of doing so, leaving you with excellent performance on both the host and on devices.

In the (hopefully not too distant) future, you should be able to express POD types directly in Java, removing some of the programming burden. There's a lot in common between IBM's 'Packed Objects' approach [1] and Oracle's 'Value Types' [2], so it seems safe to assume that the situation will improve.


It looks like details about the API are starting to show upat
However, it's not clear where to download the SDK...

Hi Jim,

I think the CUDA4J API and the GPU-optimized Java SE sorting implementation are included in the IBM SDK, Java Technology Edition 7.0:

GPU Docs:

Mehran, The CUDA4J APIs have been released in IBM Java 7 Release 1 for Linux PowerPC LE, available from the IBM developerWorks site. You can read the release documentation here

Thanks Mark! Since I'm a Windows-based Java developer, it looks like the Eclipse-based download will make the most sense.

Hi all,
Which version of the IBM JDK with CUDA should I download for Jetson Tegra TK1?
Many thanks

Is there a getting started guide for cuda4J.
If it was 1:1 with the low level API then I guess I could look at the low level getting started.
As it is not I think I would need to get a little more help.

Hello Keith,

The concepts are close to the low level API, and if you grab the latest IBM Java SDK from there are a number of examples showing the use of CUDA4J in the package. The on-line documentation for GPU programming with CUDA4J is at

In addition, there is a tutorial being written at the moment, to be delivered at the GTC conference (see If you are able to attend you can get some first-hand help with getting started!

We'll look to distilling this tutorial material into some additional on-line articles to help you get started.

Going to test today latest 8 version on Gentoo AMD64 system, as IBM SDK has been abandoned on Gentoo for reasons related to IBM license policy, etc. I will try to recover it and put into main tree as CUDA4J is really promising.

I would love to see comparison of performance against pure C on more common parallel algorithms (compressions, AES, radix sort, hash generation, etc.), but I will eventually create my own comparison :-)