Problem with kernels

Hi, I’m having trouble with getting my cuda program to work, apparently it is something to do with the way I call the kernel and what I’m telling it to do compare to what I want it to do.

OK, what I want it to do is (for now):

input x

maths calculation using x

set answer to maths calculation as x

do this 100 times

output x

Here is my code:

Help with Code Tags

[codebox]#include <cuda.h>

#include <windows.h>

#include

#include

#include

#include

#include

global void runge_4(float *x){

*x = *x + 0.01;

}

using namespace std;

float x = 1;

float *gpu;

LARGE_INTEGER numTicksPerSecond;

LARGE_INTEGER startTime;

LARGE_INTEGER endTime;

int main(){

QueryPerformanceFrequency(&numTicksPerSecond);

	QueryPerformanceCounter(&startTime);

cudaMalloc((void**)&gpu, sizeof(float));	

cudaMemcpy(gpu,&x,sizeof(float),cudaMemcpyHostToDevice);

 	runge_4<<<1, 100>>>(gpu);

	cudaMemcpy(&x, gpu,sizeof(float),cudaMemcpyDeviceToHost);

QueryPerformanceCounter(&endTime);

LONGLONG numTicks = endTime.QuadPart - startTime.QuadPart;

double numSeconds = (((double) numTicks) / (double) numTicksPerSecond.QuadPart);

cout << "Num Ticks Per Second : " << numTicksPerSecond.QuadPart << endl;

cout << "Start " << startTime.QuadPart << endl;

cout << "End : " << endTime.QuadPart << endl;

cout << "Num Ticks : " << numTicks << endl;

cout << "Num seconds : " << numSeconds << endl;





cout<<x<<endl;

cudaFree(gpu);

return 0;

}[/codebox]

The output from this code is 1.01, athough I tell the kernel to run 100 times, and changing the numbers in line 30 produce some weird results:

e.g.

runge_4<<<100, 100>>>(gpu); gives an output of 1.07

runge_4<<<1, 1000>>>(gpu); gives an output of 1

Start checking errors. You can’t launch more than 1000 threads per block, and floating point math does not work the way you think it works.

This code compiles and does not give any errors. Also I only used a value of 1000 after using a value of 100 and it doesn’t work.

I want the function to run 100 times in series

Anyone know how to do this?

The code certainly gives errors when you use 1000. CUDA calls have return values for a reason.

I am not using 1000, but the code still does not work as it should.

What I am intending to do with the following line of code is to get the kernel to execute 100 times sequentially.

runge_4<<<1, 100>>>(gpu);

Could you tell me where I am going wrong?

You’re running 100 threads in parallel, which are all writing to the same memory region without any sort of synchronization. This will not work because the schedule that this will happen is undefined. If you want to compute the sum of a series of numbers in parallel, you’re going to have to do some sort of reduction, not just naive addition.

It’s not meant to be addition, it is meant to be a iterative process, namely the runge kutta 4th order algorithm. I realise the whole point of using the GPU is parellelism, but I’m meant to code this for my project.

Is there a way of just running 1 thread 100 times? (or doing something that has the same effect)

a for loop? 100 kernel calls?

I did this but the other objective for this project is speed.

You’re asking how to use a parallel processor in a serial fashion, but you want speed. These two objectives run counter to each other.

@tmurray is right, you use whole threads to update the same location “x”, the result is undefined

__global__ void runge_4(float *x)

{

	*x = *x + 0.01;	

}

I suppose that you want to do ODE computation

d

— x = f(t,x) for x is a vector of size n

dt

consider forward euler

x_{k+1} = x_{k} + h * f(t_{k}, x_{k} )

the simple way is to use n threads, each thread update one element of x

__global__ void runge_4(float *x)

{

	int inx = threadIdx.x;

	

	x[inx] = x[inx] + f(tk, x[inx]);	

}

int main()

{

  ....

runge_4<<<1, n>>>(gpu); // n < 512 

}

I think I understand what you are trying to do, you are doing the same calculation n times using n threads and each time you do so you store the output into an array with each thread writing to a spereate element of the arrayinstead of all thread writing to the same block of memory.

One very important thing I forgot to mention is that the function runge_4 is not meant to be that simple. I was just using something that simple so I could test it out.

I will try this out and see if it works faster. Thank you both for your help.

Just realised that what you suggested isn’t what I want to do.

I am using the runge kutta 4th order algoritm for a differential of type:
dx /dt = ax

yes, it depends on complexity of forcing term f(t,x).

supposer f(t,x) = A*x, A is a matrix (this is a simple example)

then difficulty is how to compute A*x, it depends on whether A is sparse or not.