This is the formal announcement of the Kappa library to the NVIDIA forums. (Kappa has been released and increasing in functionality for a while–the current version is 1.3.2.)
The Kappa library runs on any CUDA platform. It has full support for multiple GPUs (currently compiled for 1024 GPUs per host–contact Psi Lambda LLC if more are needed). It has the largest set of language bindings of any CUDA framework. It is the only publicly announced CUDA framework that provides concurrent kernel execution–let alone automatic concurrent kernel execution. Overlapped memory transfer and kernel execution is trivially specified. Kappa is host multi-threaded with speed optimized API s and data flow scheduling that fully handles automatic CPU and GPU maximal utilization. By allowing full specification for dynamically sized CUDA kernel launches, algorithms can generalize their utilization (see the SDK radix sort example for non-generalized CUDA kernel launch).
The Kappa framework allows specifying the flow of data through any mixture of CUDA C++, OpenMP C++, and C/C++ kernels and Perl, Python, and SQL routines and statements. The flows of data can be dynamically specified as (massively) parallel partitioned flows using index notation. Each flow of data can be independent and stop at any point along the flow. This means that you can easily specify your problem as the multiple parallel flows of data through all of the possible data flow paths and let the kernels and routines along the flow cut off the further processing of paths that are not desired. This form of processing maps naturally into database transaction or pipelined operations. The indexes and parameters for the parallel flow of data may come from any kernel, routine, or (SQL) statement so that the parallel data flow execution is easily specified as controllable from any data source or calculation.
To do this, Kappa has automatic host and GPU memory management and data authority transfer. In other words, Kappa tracks all memory, data access, modules, and kernels and provides automatic transfer and clean up. Full access to the capabilities of the CUDA driver API (a superset of the CUDA runtime API) is usually easily accessed through options and is always accessible using kernels or custom keywords. Besides the obvious extensibility present using CUDA or OpenMP C++ or Perl or Python routines, Kappa supports further DSL (Domain Specific Language) development by having an easily extensible API language keyword functionality with MIT License examples provided. (The Perl and Python bindings and keyword multi-threaded routine support are all provided as MIT License source code for further development or as examples.)
Kappa has support for true lambda calculus functionality since all CUDA C++ and OpenMP C++ source code may easily be compiled, loaded, and executed dynamically at runtime. This allows, for example, for the generation of specific code for the runtime options selected, such that the code to execute is optimal for the specific operations requested. This technique has been used in the past, for example, for various bioinformatics search and alignment strategies to use code with only the branches to be executed. Another, more generic usage, is to have code that is optimal for the runtime hardware and that is (JIT) compiled for that hardware.
Visit psilambda.com to download and try the Kappa library for free. The Quick Start Guides should have you using Kappa for CUDA and OpenMP (or Perl or Python) in very little time. The User Guide will help you complete your understanding of Kappa’s capabilities while the Reference Manual provides the reference information that you need. Please feel free to use the forums at psilambda.com for further information and interaction.