Just finished writing an OpenCL neural net optimization.

I’ve just finished optimizing the artificial neural network library FANN for OpenCL. I added three new functions in an OpenCL optimized version of the FANN library:


  • This runs an ANN on thousands of inputs, resulting in an array of thousands of output values.


  • This trains an ANN on thousands of examples. When run on an appropriately fast GPU it should result in a major speedup. On a slow GPU like my 9500 GT, it runs at CPU speed.


  • One epoch of the above function. Does basically the same thing, but only once instead of iterating until the ANN is fully trained.

All of the above functions are designed to work with multiple ANNs simultaneously, but hasn’t been tested so I assume the functionality doesn’t quite work.

I haven’t gone in and cleaned or documented the functions well, but if you’re interested start with fann/src/optimized/opencl/fann.c. I need to put down FANN for at least a few months and work on another pressing project. I just wanted to get it out for other people to use now. Feel free to build on my work, make it more user-friendly, do more error checking, clean the build process, etc. and post the new code back here.

I apologize, but I don’t have much free time for any assistance, so expect to be on your own besides basic advice.

Re: speed

On my current GPU (GeForce 9500 GT), I’m getting roughly the same speed between the normal and OpenCL versions. I currently have a GTX 285 on order, and it should be at least 10x faster. With a modern GPU, such as the GTX 480, I expect it to be at least 20x faster than my 2.26GHz Nehalem Mac Pro. If you are able to benchmark, please post your results. Please try using an ANN with more neurons per layer than the xor example, though.

Here is some code to test fann_run_many() to get you started in your comparisons:

void fann_run_many(struct fann **anns, fann_type * input,

			   fann_type **output, int num_anns, int num_runs)


	unsigned int ann_num, i;


	printf("Running Scalar!\n");

	for(ann_num = 0; ann_num < num_anns; ++ann_num) {

		unsigned int num_outputs, num_inputs;

		struct fann *ann = anns[ann_num];


		num_inputs = ann->num_input;

		num_outputs = ann->num_output;


		for(i=0; i<num_runs; ++i)


				   fann_run(ann, &input[num_inputs*i]),




int main(int argc, char *argv[])


   fann_type *input, *output;

	int i, j;

//	int num_runs = 1000000;

	int num_runs = 6;


   struct fann *ann = fann_create_from_file("~/fann/tests/xor_float.net");

   assert(ann != NULL);


	// Use this to make sure we're linking to the right header

	int chk = ann->first_layer->num_inputs;


	//Get net params

	int num_in = fann_get_num_input(ann);

	int num_out = fann_get_num_output(ann);


   printf("Inputs:%5d Outputs:%5d Total:%5d\n", ann->num_input, ann->num_output, ann->num_neurons);


   input = (fann_type*)calloc(num_runs*num_in, sizeof(fann_type));

   output = (fann_type*)calloc(num_runs*num_out, sizeof(fann_type));

	//Make a gamut of input values

	for(i=0; i<num_runs*num_in; ++i){

		float dig_frac = ((float)(i % num_in)+1.0)/((float)num_in);

		float tot_frac = ((float)i)/((float)num_runs*num_in);

		input[i] = fmodf(tot_frac, dig_frac)*2.0/dig_frac-1.0;



   fann_run_many(&ann, input, &output, 1, num_runs);


	//* // Use for output comparisons

	for(i=0; i<num_runs; ++i){

		for(j=0; j<num_in; ++j){

			printf("%9f ", input[i*num_in+j]);



		for(j=0; j<num_out; ++j){

			printf(" %9f", output[i*num_out+j]);





	return 0;


Here’s a link to download my code:


Sounds so great! :thanks:
Haven’t test your code yet. You are doing what I’m expecting!

Dear mailseth,

I’m trying to use your code , back I’ve got some problems.

Can you help me?

ps. this first message it is just to check if you are still active in this forum :)