I’m replying to this thread just to help people jump-start using CUDA with Labview in Windows.
update: NI announced their CUDA package:
[url=“Welcome to NI Labs LabVIEW GPU Computing - NI Community”]http://decibel.ni.com/content/thread/3524[/url]
but at first sight it looks not so easy to use. Anyway, if somebody is interested how to implement CUDA from scratch, feel free to read.
This is not meant to be exhaustive introduction and some familiarity with CUDA is expected (please refer to the CUDA documentation, first couple of chapters in Programming Guide would be enough for start):
[url=“http://www.nvidia.com/object/cuda_develop.html”]http://www.nvidia.com/object/cuda_develop.html[/url]
So far I’ve been successfully using CUDA 2.3 (in Visual Studio 2008) and Labview 8.6 for some real time image processing and it has been working really well for me. My algorithm needs to perform around 10 filtering operations (FFT, matrix point multiplication, IFFT) and it does that in less then 50ms for images 1024x1024 (in comparison to over 300ms on CPU). This was enough for our real time application.
-
So, first thing on your list is to install CUDA drivers, toolkit and sdk from nvidia website (you need all of them).
-
Second thing is to make dlls with CUDA. In short, you would like to start with CUDA template file that comes in the “…SDK\C\src” folder. You need to modify the code and project properties (under project tab) so the compiler knows it needs to make dlls. Don’t forget to put __declspec(dllexport) void in front of every function you want to be contained in your dll.
Extensive information how to make dll can be found at different treads like this one:
[url=“http://forums.nvidia.com/index.php?showtopic=97928&pid=545650&mode=threaded&start=0#entry545650”]The Official NVIDIA Forums | NVIDIA
- You need to import that dll in the labview. You do this by using Call Library Function Node (under Connectivity/Libraries and Executables).
In short, you need to specify the path to your dll (by default it will be “…SDK\C\bin\win32\Debug” folder), and add the parameters that Labview is going to use. Parameters must match exactly with the parameters you have in your dll source code, otherwise Labview crashes).
Detailed explanation how Call Library Function Node works can be found here (its way too much then you need if you are beginner):
[url=“Product Documentation - NI”]Product Documentation - NI
- Finally, if you want to share your dll to different computers, don’t forget to compile Release version of the dll.
That’s it. I’m attaching simple code that scales array. The VI creates an 1D array and a scaling constant, passes those to a dll that performs scaling on the GPU (note: scaling is done in-place).
Note: For some reason, your Labview VI needs to be closed when dll is being compiled. This is a slight nuisance when you need to go back and forth between Labview and VS because you have to close the VI every time.
I’m more then open for suggestions, comments, etc. that would improve this brief description of how to use CUDA with Labview.
Hope it helps
Nenad
BU Biomicroscopy Lab
P.S. For some reason it turns out I can not attach VIs so I’m posting only snapshot.
// includes, project
#include <cutil_inline.h>
// Labview will pass array ‘h_a’ (‘h’ stands for host), scalar ‘alpha’ and array size.
#define BLOCKSIZE 512 // 512 is the maximum number of threads in the block.
global void ScaleMatrix_Kernel( float *d_a, float alpha, int arraySize)
{
// Block index
int bx = blockIdx.x;
// Thread index
int tx = threadIdx.x;
int begin = blockDim.x * bx;
int index = begin + tx;
// copies array into shared memory, important only if threads are communicating between each other. Its not necessary here since we are only scaling vector.
__shared__ float d_as[BLOCKSIZE];
d_as[tx] = d_a[index];
__syncthreads();
// copies array back to global device memory
d_a[index] = alpha * d_as[tx];
}
__declspec(dllexport) void ScaleMatrix(float *h_a, float alpha, int arraySize)
{
unsigned int mem_size = sizeof( float) * arraySize;
// allocate device memory
float* d_a;
cutilSafeCall( cudaMalloc( (void**) &d_a, mem_size));
// copy host memory to global device memory
cutilSafeCall( cudaMemcpy( d_a, h_a, mem_size, cudaMemcpyHostToDevice) );
// setup execution parameters
dim3 dimGrid( 1, 1, 1);
dim3 dimBlock( BLOCKSIZE, 1, 1); // assumes arraySize is the multiples of BLOCKSIZE! or less then a BLOCKSIZE
// execute the kernel
ScaleMatrix_Kernel<<< dimGrid, dimBlock>>>( d_a, alpha, arraySize);
// copy device memory to host
cutilSafeCall( cudaMemcpy( h_a, d_a, mem_size, cudaMemcpyDeviceToHost) );
cutilSafeCall(cudaFree(d_a));
}
------------------ end ---------------------------------