Examples in SDK are too difficult.

Hello,

The examples in the GPU Computing 4.0 are way too difficult for programmers trying to use the cuda driver api from other languages like Pascal/Delphi.

These examples assume that everything is already working, and this assumption is false.

Getting CUDA working from other languages is already difficult enough, especially with passing in parameters to cuda kernel calls.

There should be some examples which focus on getting cuda working.

For example:

Passing in:

Integers

Pointers

Passing out:

Integers

for example:

Kernel( int ParameterIn, int *ParameterOut )

{

*ParameterOut = ParameterIn;

}

is already difficult enough to get working.

^ This is probably the most basic cuda example that can be made working fully.

Half-full would also be usefull if it weren’t for the compiler scratching useless code away.

Therefore the code above is probably the most minimal… there must be output, otherwise all code is “scratched” away.

Furthermore:

What does “kernel parameters” except on host side ?

Pointers to host values ?

Or pointers to cuda pointers ?

It’s a bit confusing.

So far I am seeing cuda pointers being passed in to my complexer test… it’s strange.

Now I will have to fall back to a really simple test like above to see what’s happening.

I am now unsure what to pass in exactly from the host side… since there are no simple examples from the SDK available.

Just finding the “launch call site” is already difficult enough with all that complex clutter code around it.

I will now try to get this working:

extern "C" 

{ // extern c begin

__global__ void Kernel( int ParaIn, int *ParaOut )

{

	*ParaOut = ParaIn;

}

} // extern c end

Very strange…

I don’t know how to pass parameters correctly, it seems to work, but then free cuda memory doesn’t work anymore ?

Some questions:

  1. Does the parameter pointer array have to be cuda memory perhaps ?

  2. Does the parameter pointer array have to be host memory ?

  3. Can host values be passed for integers ?

  4. Must host pointers be passed for cuda integer references (int arrays) ?

  5. Should perhaps pointers to pointers be used ?!?

^ The way to pass data to cuda kernels is totally unknown and badly documented ? Unless somebody can point me to some better documentation, than just the few comments in cuda.h

A full basic example needed how to make the above example work without causing free memory to fail ?

So far it seems as if the cuda pointer is being changed which leads to problems ? Very odd.

Perhaps there is a mistake in my code somewhat but I don’t think so…

There seems to be some complexity with all of this…

what does cuda malloc actually return ?

Is it a host pointer to some cuda pointer ?

The statemens in C seems to be (void **).

Where can I find more information about all of this ???

Ok,

I see I had little bug in my code… it was returning false while the result was ok…

code was:

if

which needed to be:

if not

But some more documentation would still help…

Now I can go back to my trail and error runs ;)

I seem to have figured it out and it goes like this:

There is an inconsistency in the way the parameters are passed to kernels.

The inconsistency is this:

  1. input integers can simply be passed as host memory.

  2. output integers must be passed as cuda memory.

^ Big inconsistency.

It would have been better if input integers must also be cuda memory.

Example:

ParameterCount := 2;
Parameter[0] := vParameterIn.Address; // input integer parameter must be passed as host pointer to host memory.
Parameter[1] := @vParameterOut.Handle; // output integer parameter must be passed as a host pointer to cuda memory pointer.

Address returns host address of host memory.
Handle returns cuda memory pointer.

Now I am still having problems with multiple parameters and arrays, so moving on to next somewhat larger example…

Array kernel example:

extern “C”

{ // extern c begin

// para4 is array of 3 integers

// para5 is array of 4 integers

// return some values in them

global void Kernel( int Para1, int Para2, int Para3, int *Para4, int *Para5 )

{

Para4[0] = 111;

Para4[1] = 222;	

Para4[2] = 333;	

Para5[0] = Para1;

Para5[1] = Para2;	

Para5[2] = Para3;	

Para5[3] = 666;	

}

} // extern c end

extern "C" 

{ // extern c begin

// para4 is array of 3 integers

// para5 is array of 4 integers

// return some values in them

__global__ void Kernel( int Para1, int Para2, int Para3, int *Para4, int *Para5 )

{

	Para4[0] = 111;

	Para4[1] = 222;	

	Para4[2] = 333;	

	Para5[0] = Para1;

	Para5[1] = Para2;	

	Para5[2] = Para3;	

	Para5[3] = 666;	

}

} // extern c end

Using the same technique as above now doesn’t work… I wonder why ?!?

Ok, I spotted the problem.

The size parameter to the devic to host copy function was zero, little programming mistake in calculating the size somewhere… it wasn’t being assigned/stored.

These kinds of programming mistakes are hard to spot !

Glad I found it !

Now everything is working with above techniques ! ;)

I was already thinking about giving up on cuda and trying opencl… I’m glad it’s working now with cuda ! ;) =D