Java and Cuda connection

Hi all,

I am a new user of CUDA.
I downloaded the example of RadixSort and it can successfully run in visual studio.
However, I have a project which is java based. Can I get a connection in java to cuda, so that the result in RadixSort can return to my java program.

Anyone can give me a suggestion or example? Thanks.


Hello Lemon,

Calling the RadixSort example directly from Java could be difficult. It might be possible to compile the example using the -keep parameter in the CUDA Build Rule, to keep the intermediate files. One of these files will be the “CUBIN” file, which contains the kernels. This CUBIN could possibly be loaded using JCuda from . Then the respective sorting functions could possibly be called. But this would involve some efforts. I already had a short look at this example, and one problem might be that it makes heavy use of C+±template-magic to generate dozens of versions for the function which actually perfoms one sort step. Calling these functions properly from Java considering their oddly mangled names would be no fun…

Fortunately, the algorithm that is implemented in the RadixSort example is already available in a library, namely in the CUDPP 1.1 library from . And fortunately, Java Bindings for CUDPP are already available at . The JCudpp sample at already shows how this fast parallel RadixSort algorithm may be called from Java - using ~15 lines of code :)

Note that the current versions of the binaries that are avaiable at are compiled for CUDA 2.3 (this should probably be pointed out more clearly on the website…). If you are using a different version (like the CUDA 3.0 beta) you might have to recompile the JCudaRuntime library and the CUDPP library, but the source code and Visual Studio Project files are available for both libraries, so this should not be too much effort.



Hi Marco,

You are so nice. I will follow your steps to try it.

Thank you very much.



Hi Marco,

Do you have any idea why the sorting test in is failed?

I am trying to sort 10 elelments and the result is :

Creating input data
[0, 8, 9, 7, 5, 3, 1, 1, 9, 4]
Performing sort with Java…
[0, 1, 1, 3, 4, 5, 7, 8, 9, 9]
Performing sort with JCudpp…
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
testSort FAILED

I have no idea on it…



Strange thing. Are there any specificities with your environment? E.g. did you use a different CUDA version than 2.3?

Note that in the sample, all error checks are omitted. You may add the lines



in the main method, so that error checks are performed internally, and a (hopefully helpful and explanatory) Exception will be thrown if anything goes wrong.




My CUDA version is 2.3.

After adding the error checks, there is exception of the code.

Exception in thread “main” jcuda.CudaException: cudaErrorNoDevice

I think it is my setting problem, I will check it later.

Thanks for your help.



Hi Marco,

I am using GT240 graphic card.
I can run the radixSort example in visual studio, so it implies i have a cuda device.
I installed CUDA toolkit v2.3 and cudasdk_2.3.

Do you have any idea why I have cudaErrorNoDevice exception in the



Currently, I have no idea what might be the reason for this error. I assume the exception is thrown at the first call to cudaMalloc?

You could try running the following program, which is the JCuda-version of the cudaDeviceQuery sample, to see which information about available devices it prints:


import jcuda.runtime.*;

import static jcuda.runtime.JCuda.*;

import static jcuda.runtime.cudaError.*;

import static jcuda.runtime.cudaComputeMode.*;

public class JCudaDeviceQueryTest


public static void main(String args[])



int deviceCountArray = new int[1];

    if (cudaGetDeviceCount(deviceCountArray) != cudaSuccess)


        System.out.printf("cudaGetDeviceCount failed! CUDA Driver and Runtime version may be mismatched.\n");

        System.out.printf("\nTest FAILED!\n");



    int deviceCount = deviceCountArray[0];

// This function call returns 0 if there are no CUDA capable devices.

    if (deviceCount == 0)


        System.out.println("There is no device supporting CUDA");


int dev;

    int driverVersionArray[] = new int[1];

    int runtimeVersionArray[] = new int[1];

    for (dev = 0; dev < deviceCount; ++dev)


        cudaDeviceProp deviceProp = new cudaDeviceProp();

        cudaGetDeviceProperties(deviceProp, dev);

if (dev == 0)


            // This function call returns 9999 for both major & minor fields, if no CUDA capable devices are present

            if (deviceProp.major == 9999 && deviceProp.minor == 9999)

                System.out.printf("There is no device supporting CUDA.\n");

            else if (deviceCount == 1)

                System.out.printf("There is 1 device supporting CUDA\n");


                System.out.printf("There are %d devices supporting CUDA\n", deviceCount);


String name = new String(;

        name = name.substring(0, name.indexOf(0));

        System.out.printf("\nDevice %d: \"%s\"\n", dev, name);


        int driverVersion = driverVersionArray[0];

        System.out.printf("  CUDA Driver Version:                           %d.%d\n", driverVersion / 1000, driverVersion % 100);


        int runtimeVersion = runtimeVersionArray[0];

        System.out.printf("  CUDA Runtime Version:                          %d.%d\n", runtimeVersion / 1000, runtimeVersion % 100);

System.out.printf(" CUDA Capability Major revision number: %d\n", deviceProp.major);

        System.out.printf("  CUDA Capability Minor revision number:         %d\n", deviceProp.minor);

System.out.printf(" Total amount of global memory: %d bytes\n", deviceProp.totalGlobalMem);

        System.out.printf("  Number of multiprocessors:                     %d\n", deviceProp.multiProcessorCount);

        System.out.printf("  Number of cores:                               %d\n", 8 * deviceProp.multiProcessorCount);

        System.out.printf("  Total amount of constant memory:               %d bytes\n", deviceProp.totalConstMem);

        System.out.printf("  Total amount of shared memory per block:       %d bytes\n", deviceProp.sharedMemPerBlock);

        System.out.printf("  Total number of registers available per block: %d\n", deviceProp.regsPerBlock);

        System.out.printf("  Warp size:                                     %d\n", deviceProp.warpSize);

        System.out.printf("  Maximum number of threads per block:           %d\n", deviceProp.maxThreadsPerBlock);

        System.out.printf("  Maximum sizes of each dimension of a block:    %d x %d x %d\n", 




        System.out.printf("  Maximum sizes of each dimension of a grid:     %d x %d x %d\n", 




        System.out.printf("  Maximum memory pitch:                          %d bytes\n", deviceProp.memPitch);

        System.out.printf("  Texture alignment:                             %d bytes\n", deviceProp.textureAlignment);

        System.out.printf("  Clock rate:                                    %.2f GHz\n", deviceProp.clockRate * 1e-6f);

        System.out.printf("  Concurrent copy and execution:                 %s\n", deviceProp.deviceOverlap != 0 ? "Yes" : "No");

        System.out.printf("  Run time limit on kernels:                     %s\n", deviceProp.kernelExecTimeoutEnabled != 0 ? "Yes" : "No");

        System.out.printf("  Integrated:                                    %s\n", deviceProp.integrated != 0 ? "Yes" : "No");

        System.out.printf("  Support host page-locked memory mapping:       %s\n", deviceProp.canMapHostMemory != 0 ? "Yes" : "No");

        System.out.printf("  Compute mode:                                  %s\n", 

            deviceProp.computeMode == cudaComputeModeDefault ? "Default (multiple host threads can use this device simultaneously)" : 

                deviceProp.computeMode == cudaComputeModeExclusive ? "Exclusive (only one host thread at a time can use this device)" : 

                    deviceProp.computeMode == cudaComputeModeProhibited ? "Prohibited (no host thread can use this device)" : "Unknown");





Maybe this will bring some insights…



Hello Marco,

After copying the code, the result is shown as follow:

There is 1 device supporting CUDA

Device 0: “GeForce GT 240”
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 2
Total amount of global memory: 1073414144 bytes
Number of multiprocessors: 12
Number of cores: 96
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.34 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: No
Compute mode: Default (multiple host threads can use this device simultaneously)


Hi Marco,

I haven’t do anything. The sorting program is successful now!!

The exception throw in JCuda.cudaMalloc in the first time.
But today, it’s ok. It’s amazing…

Thanks your great help.


Hi Marco,

Is it impossible to do the sorting with unsigned long type data?


Hello Lemon,

With JCudpp, this is not possible - basically because it is not possible with CUDPP. The original radixSort sample also applies only to unsigned ints (or floats, which are converted to unsigned ints for sorting). I assume that it could be a high effort to extend it to long values (despite some 32/64bit issues that might come up then), and when this is not available in a library, calling these functions from Java would not be so easy…