Kappa library announcement The Kappa Framework for parallel components

This is the formal announcement of the Kappa library to the NVIDIA forums. (Kappa has been released and increasing in functionality for a while–the current version is 1.3.2.)

The Kappa library runs on any CUDA platform. It has full support for multiple GPUs (currently compiled for 1024 GPUs per host–contact Psi Lambda LLC if more are needed). It has the largest set of language bindings of any CUDA framework. It is the only publicly announced CUDA framework that provides concurrent kernel execution–let alone automatic concurrent kernel execution. Overlapped memory transfer and kernel execution is trivially specified. Kappa is host multi-threaded with speed optimized API s and data flow scheduling that fully handles automatic CPU and GPU maximal utilization. By allowing full specification for dynamically sized CUDA kernel launches, algorithms can generalize their utilization (see the SDK radix sort example for non-generalized CUDA kernel launch).

The Kappa framework allows specifying the flow of data through any mixture of CUDA C++, OpenMP C++, and C/C++ kernels and Perl, Python, and SQL routines and statements. The flows of data can be dynamically specified as (massively) parallel partitioned flows using index notation. Each flow of data can be independent and stop at any point along the flow. This means that you can easily specify your problem as the multiple parallel flows of data through all of the possible data flow paths and let the kernels and routines along the flow cut off the further processing of paths that are not desired. This form of processing maps naturally into database transaction or pipelined operations. The indexes and parameters for the parallel flow of data may come from any kernel, routine, or (SQL) statement so that the parallel data flow execution is easily specified as controllable from any data source or calculation.

To do this, Kappa has automatic host and GPU memory management and data authority transfer. In other words, Kappa tracks all memory, data access, modules, and kernels and provides automatic transfer and clean up. Full access to the capabilities of the CUDA driver API (a superset of the CUDA runtime API) is usually easily accessed through options and is always accessible using kernels or custom keywords. Besides the obvious extensibility present using CUDA or OpenMP C++ or Perl or Python routines, Kappa supports further DSL (Domain Specific Language) development by having an easily extensible API language keyword functionality with MIT License examples provided. (The Perl and Python bindings and keyword multi-threaded routine support are all provided as MIT License source code for further development or as examples.)

Kappa has support for true lambda calculus functionality since all CUDA C++ and OpenMP C++ source code may easily be compiled, loaded, and executed dynamically at runtime. This allows, for example, for the generation of specific code for the runtime options selected, such that the code to execute is optimal for the specific operations requested. This technique has been used in the past, for example, for various bioinformatics search and alignment strategies to use code with only the branches to be executed. Another, more generic usage, is to have code that is optimal for the runtime hardware and that is (JIT) compiled for that hardware.

Visit psilambda.com to download and try the Kappa library for free. The Quick Start Guides should have you using Kappa for CUDA and OpenMP (or Perl or Python) in very little time. The User Guide will help you complete your understanding of Kappa’s capabilities while the Reference Manual provides the reference information that you need. Please feel free to use the forums at psilambda.com for further information and interaction.

We have a system with 1025 TESLAs on it… We have hired a thermal power station to keep it cool. Unfortunate that you guys support only 1024. ;-)

Anyway,
I tried downloading the sample. I could not make head or tail of it. Can you briefly explain what exactly Kappa is in a few lines? Thanks!
A sample code snippet would be gr8!

We have a system with 1025 TESLAs on it… We have hired a thermal power station to keep it cool. Unfortunate that you guys support only 1024. ;-)

Anyway,
I tried downloading the sample. I could not make head or tail of it. Can you briefly explain what exactly Kappa is in a few lines? Thanks!
A sample code snippet would be gr8!

Let me know once you get the cooling going–I 'll give you a version with 2048 GPUs or however many you need–I was just trying to do my part to prevent global warming External Media .

This part of the Quick Start Guide shows processing data from a SQL data source, partitioned into four parallel data flows to the GPU. Here is what the output from that looks like:

[codebox]/usr/bin/time BUILD/m64/opengl/ikappa/ikappa sqltest/read.k

number of categories: 4 categories: 1 2 3 4

1048576 rows 40 bytes per row 65536 rows per batch

1048576 rows 40 bytes per row 65536 rows per batch

1048576 rows 40 bytes per row 65536 rows per batch

1048576 rows 40 bytes per row 65536 rows per batch

number of loops: 16 = 1048576 / 65536

number of loops: 16 = 1048576 / 65536

number of loops: 16 = 1048576 / 65536

number of loops: 16 = 1048576 / 65536

Processing time: 2730.76 (ms)

8.69user 2.00system 0:22.84elapsed 46%CPU (0avgtext+0avgdata 3821920maxresident)k

0inputs+0outputs (0major+233182minor)pagefaults 0swaps[/codebox]

The four parallel data flows are dynamically determined from the data source and used to expand the labels. Here is what was required (besides configuring the SQL connection parameters) to make that happen:

[codebox]

// The main IO loop

!SQL ASYNC=true FAST=true → read@dbhandle_$a(OUT_$a, #chunk_size, #rows_read_$a);

!CUDA/Kernel STREAM=str_$a OUTPUT_WARNING=false → sqltest@sqltest(OUT_$a, #rows_read_$a) [ = OUT_$a #rows_read_$a];

!SQL → connect@dbhandle_$a(‘pgsql’,{PGPARAMS});

!SQL ASYNC=true STAR=true → select@dbhandle_$a(‘select pk_sid, dima, dimb, dimc, dimd, dime, dimf, measurea, measureb, measurec from star_table where cat_pk_sid= %u order by dima;’, $a, Categories, ‘=%lu %u %u %u %u %u %u +%f %u %lf’, #num_rows_$a, #num_cols_$a, #row_size_$a);

// Get the number of rows to process at once using an if evaluation.

!Value → rows_allocate_$a = if ( ( #chunk_size < #num_rows_$a ) , #chunk_size , #num_rows_$a );

!Variable STREAM=str_$a VARIABLE_TYPE=%KAPPA{LocalToDevice} → OUT_$a(#rows_allocate_$a, #row_size_$a);

// Calculate how many iterations based on the number of rows and

// how many rows to process at once.

!Value → numloops_$a = ( #num_rows_$a / #chunk_size );

// Perform a synchronization so the #numloops_$a Value is ready

!Synchronize (#numloops_$a);

!Print ('number of loops: ', #numloops_$a, ’ = ’ , #num_rows_$a, ’ / ’ , #chunk_size );

!Subroutine LABELSET=‘sql’ UNROLL=true LOOP=#numloops_$a → sqlio;

!SQL → disconnect@dbhandle_$a(); // disconnect dbhandle

!CUDA/Module → sqltest = ‘sqltest/sqltest.cu’;

//Set the size of the data to process at once

!Value → chunk_size = 65536;

// Connect to the database and get the categories to use for splitting into parallel processes

!SQL → connect@dbmain(‘pgsql’,{PGPARAMS});

!SQL → select@dbmain(‘select distinct cat_pk_sid from star_table;’, ‘%u’, #num_rows_cat, #num_cols_cat, #row_size_cat);

!Variable → Categories(#num_rows_cat,#row_size_cat);

!SQL → read@dbmain(Categories,#num_rows_cat,#rows_read_cat);

!SQL → disconnect@dbmain();

!Value → cat_indice = Categories;

!Print ( 'number of categories: ', #rows_read_cat, 'categories: ', #cat_indice);

// Synchronize the Value of how many categories so that Expand can use it as an argument

!Synchronize (#rows_read_cat);

// Expand and run the processing in parallel across the categories

!Expand LABELSET=sql → sqlprocess(#rows_read_cat);

// Unload, cleanup, stop

!CUDA/ModuleUnload → sqltest;

// Setup the CUDA context and load the CUDA module

!Context → context;

!Subroutine → sqlread;

!ContextReset → Context_reset;

!Stop;

!Finish;

[/codebox]

The CUDA streams, “str_$a”, are being manually assigned in order to get the right overlapping memory transfer/kernel execution but still enable concurrent kernel execution. The “$a” part of the stream name is expanded from the data source to give, in this example data source, four different CUDA streams.

The sqltest.cu that contains the sqltest kernel in the sqltest module is, if needed, compiled to PTX, and is JIT loaded and compiled.

It may be confusing if you are looking for a lot of code to allocate and free memory, create and destroy streams, transfer data, load and unload modules, and launch kernels since you are not going to find it–you are just going to find stuff like: “Variable”, “STREAM”, “CUDA/Module”, “CUDA/Kernel”, or “C/Kernel”. I must admit that I have sometimes found assembly or machine code to sometimes be clearer External Media . Fortunately, in this case, the simple, short form is usually as fast or faster. (Also, there are trace flags that can be turned on if you want to see under the hood.)

The first two subroutines could be provided as a compiled shared library that is loaded by the “process->LoadRoutine ();” method and the rest of your code could have setup the CUDA context so that you just invoke:

[codebox]

process->ExecuteString(string(“\n!Subroutine → sqlread;\n”);

[/codebox]

but that makes it too simple. (The ikappa example program can be used to turn the “sqlio” and “sqlread” subroutines shown earlier into a project for a C++ shared library–an example of the shared library project files is in the Quick Start Guide.)

As far as looking at C++/C# examples–the ikappa and wkappa examples are very simple but cover everything. The Kappa for Perl and Kappa for Python web pages (under Download) have simple examples for Perl and Python.

Let me know once you get the cooling going–I 'll give you a version with 2048 GPUs or however many you need–I was just trying to do my part to prevent global warming External Media .

This part of the Quick Start Guide shows processing data from a SQL data source, partitioned into four parallel data flows to the GPU. Here is what the output from that looks like:

[codebox]/usr/bin/time BUILD/m64/opengl/ikappa/ikappa sqltest/read.k

number of categories: 4 categories: 1 2 3 4

1048576 rows 40 bytes per row 65536 rows per batch

1048576 rows 40 bytes per row 65536 rows per batch

1048576 rows 40 bytes per row 65536 rows per batch

1048576 rows 40 bytes per row 65536 rows per batch

number of loops: 16 = 1048576 / 65536

number of loops: 16 = 1048576 / 65536

number of loops: 16 = 1048576 / 65536

number of loops: 16 = 1048576 / 65536

Processing time: 2730.76 (ms)

8.69user 2.00system 0:22.84elapsed 46%CPU (0avgtext+0avgdata 3821920maxresident)k

0inputs+0outputs (0major+233182minor)pagefaults 0swaps[/codebox]

The four parallel data flows are dynamically determined from the data source and used to expand the labels. Here is what was required (besides configuring the SQL connection parameters) to make that happen:

[codebox]

// The main IO loop

!SQL ASYNC=true FAST=true → read@dbhandle_$a(OUT_$a, #chunk_size, #rows_read_$a);

!CUDA/Kernel STREAM=str_$a OUTPUT_WARNING=false → sqltest@sqltest(OUT_$a, #rows_read_$a) [ = OUT_$a #rows_read_$a];

!SQL → connect@dbhandle_$a(‘pgsql’,{PGPARAMS});

!SQL ASYNC=true STAR=true → select@dbhandle_$a(‘select pk_sid, dima, dimb, dimc, dimd, dime, dimf, measurea, measureb, measurec from star_table where cat_pk_sid= %u order by dima;’, $a, Categories, ‘=%lu %u %u %u %u %u %u +%f %u %lf’, #num_rows_$a, #num_cols_$a, #row_size_$a);

// Get the number of rows to process at once using an if evaluation.

!Value → rows_allocate_$a = if ( ( #chunk_size < #num_rows_$a ) , #chunk_size , #num_rows_$a );

!Variable STREAM=str_$a VARIABLE_TYPE=%KAPPA{LocalToDevice} → OUT_$a(#rows_allocate_$a, #row_size_$a);

// Calculate how many iterations based on the number of rows and

// how many rows to process at once.

!Value → numloops_$a = ( #num_rows_$a / #chunk_size );

// Perform a synchronization so the #numloops_$a Value is ready

!Synchronize (#numloops_$a);

!Print ('number of loops: ', #numloops_$a, ’ = ’ , #num_rows_$a, ’ / ’ , #chunk_size );

!Subroutine LABELSET=‘sql’ UNROLL=true LOOP=#numloops_$a → sqlio;

!SQL → disconnect@dbhandle_$a(); // disconnect dbhandle

!CUDA/Module → sqltest = ‘sqltest/sqltest.cu’;

//Set the size of the data to process at once

!Value → chunk_size = 65536;

// Connect to the database and get the categories to use for splitting into parallel processes

!SQL → connect@dbmain(‘pgsql’,{PGPARAMS});

!SQL → select@dbmain(‘select distinct cat_pk_sid from star_table;’, ‘%u’, #num_rows_cat, #num_cols_cat, #row_size_cat);

!Variable → Categories(#num_rows_cat,#row_size_cat);

!SQL → read@dbmain(Categories,#num_rows_cat,#rows_read_cat);

!SQL → disconnect@dbmain();

!Value → cat_indice = Categories;

!Print ( 'number of categories: ', #rows_read_cat, 'categories: ', #cat_indice);

// Synchronize the Value of how many categories so that Expand can use it as an argument

!Synchronize (#rows_read_cat);

// Expand and run the processing in parallel across the categories

!Expand LABELSET=sql → sqlprocess(#rows_read_cat);

// Unload, cleanup, stop

!CUDA/ModuleUnload → sqltest;

// Setup the CUDA context and load the CUDA module

!Context → context;

!Subroutine → sqlread;

!ContextReset → Context_reset;

!Stop;

!Finish;

[/codebox]

The CUDA streams, “str_$a”, are being manually assigned in order to get the right overlapping memory transfer/kernel execution but still enable concurrent kernel execution. The “$a” part of the stream name is expanded from the data source to give, in this example data source, four different CUDA streams.

The sqltest.cu that contains the sqltest kernel in the sqltest module is, if needed, compiled to PTX, and is JIT loaded and compiled.

It may be confusing if you are looking for a lot of code to allocate and free memory, create and destroy streams, transfer data, load and unload modules, and launch kernels since you are not going to find it–you are just going to find stuff like: “Variable”, “STREAM”, “CUDA/Module”, “CUDA/Kernel”, or “C/Kernel”. I must admit that I have sometimes found assembly or machine code to sometimes be clearer External Media . Fortunately, in this case, the simple, short form is usually as fast or faster. (Also, there are trace flags that can be turned on if you want to see under the hood.)

The first two subroutines could be provided as a compiled shared library that is loaded by the “process->LoadRoutine ();” method and the rest of your code could have setup the CUDA context so that you just invoke:

[codebox]

process->ExecuteString(string(“\n!Subroutine → sqlread;\n”);

[/codebox]

but that makes it too simple. (The ikappa example program can be used to turn the “sqlio” and “sqlread” subroutines shown earlier into a project for a C++ shared library–an example of the shared library project files is in the Quick Start Guide.)

As far as looking at C++/C# examples–the ikappa and wkappa examples are very simple but cover everything. The Kappa for Perl and Kappa for Python web pages (under Download) have simple examples for Perl and Python.

Hi Kappa,

I am sorry… but I really dont understand what “sql” is doing here… May be, Kappa is getting domain-specific… And ,it is for this precise reason, it would be great if you could give a brief description of what the software achieves in a few lines… Just from a high level, what it all means. I dont even need a code snippet…

I just want to register Kappa in my mind so that I can look at it at the right time for the right job. Thanks!

Hi Kappa,

I am sorry… but I really dont understand what “sql” is doing here… May be, Kappa is getting domain-specific… And ,it is for this precise reason, it would be great if you could give a brief description of what the software achieves in a few lines… Just from a high level, what it all means. I dont even need a code snippet…

I just want to register Kappa in my mind so that I can look at it at the right time for the right job. Thanks!

To me this looks like a wrapper around CUDA so that you don’t have to worry about writing any host code
with the CUDA driver or runtime API.

It offers a lot of interfaces to stream data in from various sources (including databases, files, etc)
and it offers a plethora of language bindings (perl, python, etc…)

The main drawback (to me) seems to be that you have to learn another API. But once you’re familiar
with that, I guess you can mostly focus on writing the CUDA kernels, instead of dealing with all that
nasty host-side stuff (memory allocations, copying data, streams API, …)

Judging by the activity on their web site forums, they don’t have a lot of active users yet. ;)

Christian

To me this looks like a wrapper around CUDA so that you don’t have to worry about writing any host code
with the CUDA driver or runtime API.

It offers a lot of interfaces to stream data in from various sources (including databases, files, etc)
and it offers a plethora of language bindings (perl, python, etc…)

The main drawback (to me) seems to be that you have to learn another API. But once you’re familiar
with that, I guess you can mostly focus on writing the CUDA kernels, instead of dealing with all that
nasty host-side stuff (memory allocations, copying data, streams API, …)

Judging by the activity on their web site forums, they don’t have a lot of active users yet. ;)

Christian

Hi Christian,

Thanks for a good explanation. I was just looking for that kind of explanation. I hope Kappa can confirm that.
Its good!

Best Regards,
Sarnath

Hi Christian,

Thanks for a good explanation. I was just looking for that kind of explanation. I hope Kappa can confirm that.
Its good!

Best Regards,
Sarnath

I can confirm that Christian described it very well-thanks! It was designed to let you focus on the kernels and algorithms and not get distracted by details surrounding them. Even more important is that it is designed to let you write abstracted functionality on top of the kernels and algorithms without the whole thing becoming a mess. By a mess, I mean not being able to hide the sizing of kernel launches, worrying about GPU occupancy and capability, and other implementation details that have nothing to do with the abstract class you are trying to implement. A common example of what I mean by abstracted functionality would be a Matrix class with kernels (or CUBLAS) providing the usual methods.

I admit it is another API to learn. To make up for that, it is a simple API that you can extend and change. You can actually override any of the built-in keywords with your own if you wish. See the CSV, Perl, or Python keyword examples for how to do this. The attributes for keywords are checked, if they are checked or used at all, by the keywords themselves so that new, unforeseen options and changes to functionality are easy to implement. I am open to adding new syntax patterns to the parser or the parser can be by-passed or replaced.

There is some extra functionality that Kappa provides that can be ignored if you wish. I will provide it in case it helps anyone–if it just seems arcane then ignore it and use the original announcement to understand this functionality. The equation:

may at least be understood as finding the optimal distribution function solutions along different paths, where the different paths implementation or representation are specified by kappa. If this does not make much sense to you, then just rest assured, whether you are a potential or current Psi Lambda LLC customer or a competitor, that not only is the design and implementation of the Kappa Library based on years of experience in enterprise scale critical software implementations but that it also implements parallel computational models derived from the most advanced fundamental mathematical principles.

I can confirm that Christian described it very well-thanks! It was designed to let you focus on the kernels and algorithms and not get distracted by details surrounding them. Even more important is that it is designed to let you write abstracted functionality on top of the kernels and algorithms without the whole thing becoming a mess. By a mess, I mean not being able to hide the sizing of kernel launches, worrying about GPU occupancy and capability, and other implementation details that have nothing to do with the abstract class you are trying to implement. A common example of what I mean by abstracted functionality would be a Matrix class with kernels (or CUBLAS) providing the usual methods.

I admit it is another API to learn. To make up for that, it is a simple API that you can extend and change. You can actually override any of the built-in keywords with your own if you wish. See the CSV, Perl, or Python keyword examples for how to do this. The attributes for keywords are checked, if they are checked or used at all, by the keywords themselves so that new, unforeseen options and changes to functionality are easy to implement. I am open to adding new syntax patterns to the parser or the parser can be by-passed or replaced.

There is some extra functionality that Kappa provides that can be ignored if you wish. I will provide it in case it helps anyone–if it just seems arcane then ignore it and use the original announcement to understand this functionality. The equation:

may at least be understood as finding the optimal distribution function solutions along different paths, where the different paths implementation or representation are specified by kappa. If this does not make much sense to you, then just rest assured, whether you are a potential or current Psi Lambda LLC customer or a competitor, that not only is the design and implementation of the Kappa Library based on years of experience in enterprise scale critical software implementations but that it also implements parallel computational models derived from the most advanced fundamental mathematical principles.