Iterative computations Creating of efficient iterative computations using CUDA

Hello,

I’m sorry if the same topic was created before and in this case will be glad to recieve the link. Also I will be grateful if some body indicate me the way of the solution of the next problem:

I have to develop iterative algorithm based on CUDA capabilities. The problem looks like that:

  1. CUDA kernel call…
  2. Recieving data from device (cudaMemcpy)
  3. Write data to the file
  4. Retrun to the step 1

The step 2 is very expensive. It needs rather long time than step 1. So I need suggestion how to optimize this scheme.

Thanks for your time

Obviously the device to host copy is the first thing to optimize then. How fast is that transfer - do you get anywhere near the results from the bandwidth test in the SDK?
How much data is transferred in each iteration? If it is not much, you might want to save it in a buffer and then transfer results from several iterations at once.

If optimizing the transfer somehow gets the transfer time closer to the execution time of the kernel, overlapping them becomes worthwhile too.

I don’t have much advice other than normal methods for optimizing data transfers (use pinned memory and perhaps zero copy).

FYI, DirectCompute includes a “DispatchIndirect” function where the number of threads to launch can be specified indirectly from the contents of a buffer in GPU memory, which avoids reading back data to the CPU. This kind of functionality will probably come to CUDA at some point.

I’ve implemented a SelectNth/Median algorithm which involves an arbitrary # kernel launches. See here. My solution was relatively simple and effective:
Implement a cutoff, such that after the data gets reduced to a small size (when kernel launch time dominates), I just switch to sorting the remaining data, instead of continuing with iteration.

Other solutions:

  1. Wait for an on die CPU, which will greatly reduce latency.
  2. Speculative execution - Launch the kernel multiple times speculatively, but check at the beginning if the termination condition has been met and do nothing if true.

Thanks to all… I’m going to try consistently suggestions you gave. Also I was thinking over copying data to the buffer and then copying them to the host. It seems to be a good idea.

I’m want make short report.

I was following suggestions which tera gave me. So I use GeForce9800GT the bandwith test indicates that the speed of reading\writing

HostToDevice(DeviceToHost) is approximatly 1,5Gbytes per second. I suppose it must be enought to transfer data of size 12Mb, isn’t it?..

Also I made some simple test(but right now without values). And the test show me if data size approches to 12Mb my application speed extremly decreasing.

I made the conclusion that I need to cutoff the data into series of small data parts (it seems to be the solution Uncle Joe) and then process them separately.

Sorry, I misunderstood your question Dmitrii. My comments don’t really apply to your situation.

If the copy step takes much longer than the calculation step, then your GPU utilization will be low, even if you overlap transfers and computation perfectly.

The question I would ask is if all that data really needs to be written to the file? What’s your application
and can you reduce the data being written?

If not, then I don’t think you want to do that part of the application on the GPU.

If transfer time dominates your iteration, and you need to transfer 12Mb at 1.5 Gb/s, you still should be able to run about 100 iterations per second.
If you actually get to that order of magnitude, writing the data out to disk will be the bottleneck, not the PCIe transfer.

Are you aware that kernel launches are asynchronous? I.e. if you launch a kernel and then do a device->host memcpy (your steps 1. and 2.), the kernel launch will return immediately and the memcpy will then wait for the kernel to finish before doing the actual transfer.

There is the code which illustrates the idea. I think it doesn’t new…

The program calculate the summ of two matrices and stores the result into the third. Very simple…

#include "stdafx.h"

#include <cuda.h>

#include <stdio.h>

// Data declaration

float *host_a;

float *host_b;

float *host_c;

float *dev_a;

float *dev_b;

float *dev_c;

size_t SIZE;

// Data properties declaration

const int COLS = 128;

const int ROWS = 128;

// GPU properties

const int BLOCK_SIZE = 8;

// Test options

bool MAKE_DATA_COPY = false;

bool MAKE_CPU_TEST = false;

// The gpu computational procedure. Here I'm geting the Index of the item in the vector structure.

__global__ void MatSumGPU(float *A, float *B, float *result, int cols, int rows)

{

		int i = blockIdx.x * blockDim.x + threadIdx.x;

		int j = blockIdx.y * blockDim.y + threadIdx.y;	

		if (((0 <= i) && (i < cols)) && ((0 <= j) && (j < rows)))

		{

			int index = i + j * cols;			

			result[index] = A[index] + B[index];

		}

}

// The cpu computational procedure

void MatSumCPU(float *A, float *B, float *result, int cols, int rows)

{

	for (int i = 0; i < cols * rows; i++)

	{

		result[i] = A[i] + B[i];

	}

}

// Memory initialization procedure

void InitMem()

{

	SIZE = COLS * ROWS * sizeof(float);

	host_a =(float *)malloc(SIZE);

	host_b =(float *)malloc(SIZE);

	host_c =(float *)malloc(SIZE);

	for(int i = 0; i < COLS * ROWS; i++)

	{

		host_a[i] = 1;

		host_b[i] = 2;

		host_c[i] = 0;

	}

	cudaMalloc((void **)&dev_a, SIZE);

	cudaMalloc((void **)&dev_b, SIZE);

	cudaMalloc((void **)&dev_c, SIZE);

}

// Memory releasing procedure

void FreeMem()

{

	free(host_a);

	free(host_b);

  free(host_c);

	cudaFree(dev_a);

	cudaFree(dev_b);

	cudaFree(dev_c);

}

int _tmain(int argc, _TCHAR* argv[])

{

	// 1. Memory initialize

	InitMem();

  	

	// 2. Copying the data from the host to device

	cudaMemcpy(dev_a, host_a, SIZE, cudaMemcpyHostToDevice);

	cudaMemcpy(dev_b, host_b, SIZE, cudaMemcpyHostToDevice);

	// 3. Making fo the GPU computations

	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);	

	int Cols = (COLS + BLOCK_SIZE - 1) / BLOCK_SIZE;

	int Rows = (ROWS + BLOCK_SIZE - 1) / BLOCK_SIZE;

	printf("%d\n", Cols);

	printf("%d\n", Rows);

	dim3 dimGrid(Cols, Rows);

	printf("GPU computation started\n");

	int start = time(NULL);

	for(int i = 0; i < 10000; i++)

	{

		MatSumGPU<<<dimGrid, dimBlock>>>(dev_a, dev_b, dev_c, COLS, ROWS);  

		

		// Making the data copying from the devie to host if this option enabled

		if (MAKE_DATA_COPY == true)

		{

			cudaMemcpy(host_c, dev_c, SIZE, cudaMemcpyDeviceToHost);		

		}

	}	

	int end = time(NULL);

	printf("Finished at: %i\n", end - start);

	printf("\n");

	// 4. Making CPU computation if this option enabled

	if (MAKE_CPU_TEST == true)

	{

		printf("CPU computation started\n");

		start = time(NULL);

		for (int i = 0; i < 10000; i++)

		{

			MatSumCPU(host_a, host_b, host_c, COLS, ROWS);

		}

		end = time(NULL);

		printf("Finished at: %i\n", end - start);		

	}

	FreeMem();

	return 0;

}

My computer has next parameters:

Pentium Dual-Core E5500 2,8 Ghz
2,00GB RAM
GeForce 9800 GT

The test results is next:

I.
MAKE_DATA_COPY flag is TRUE
MAKE_CPU_TEST flag is TRUE

The GPU computation time is approximately 110 msc
The CPU computation time is approximately 77 msc!!!

II.
I.
MAKE_DATA_COPY flag is FALSE
MAKE_CPU_TEST flag is TRUE

The GPU computation time is approximately 48 msc
The CPU computation time is approximately 77 msc

May be I have a logical error?
It should work faster!
May be I’m looking the solution at the wrong place?
The solution have to be similar to the solver for the System of Linear Equations…

As I wrote earlier, you need a [font=“Courier New”]cudaThreadSynchronize();[/font] after the kernel launch in order to measure the time to execute the kernel. Without this, you just measure the time to launch the kernel, and the cudaMemcpy will appear to take up all time while in fact it is just waiting for the kernel to finish.

To make your kernel run faster, use a blockSize.x that is a multiple of 16. This will allow the 16 memory accesses from a half-warp to be coalesced into a single memory transaction. You will however never be able to reach the speed of the CPU version as this simple kernel is entirely bound by the time it takes to transfer the data through PCIe to and from the GPU. You need a task with more computational intensity (more computation per byte transferred) in order to beat the CPU.

Times for both CPU and GPU execution seem to be rather large (shouldn’t take the computer long to add just 64kb of data), which probably is due to the fact that the time() function has a quite low precision (somewhere between 10 and 55 miliseconds I think, depending on your hardware). Use a more precise timer to get accurate timings. gettimeofday() would be the way to go under Linux and other Unix-like systems. QueryPerformanceCounter() is the function of choice under Windows, but I am not familiar with the latter.

Finally, next time you post code in the forum please enclose in between [code] [/code] tags.

I can see 3 solutions only:

  1. merge all your kernels into 1 big kernel with manual block synchronisation
    The biggest drawback of this method is that the register count will increase significantly, for example, if you are merging kernels which use 30 regs each, then the merged kernel may use from 30 to 60 regs.
  2. As mentioned - do speculative kernel launches on cpu, and spin checking some variable status which you update everytime in the kernel. The biggest drawback here is that this method actually introduces a cpu-gpu synchronization, and if you want to further submit some work to the gpu after that, you will have a gap when gpu doesn’t do anything (this can be up to a couple of milliseconds) because cpu did not subming anything yet.
  3. Only if you know maximum iteration count, then you just submit maximum iteration number, and inside the kernel before doing any executions you check if you need to run this iteration or just skip it since all has been done during previous iterations.

As for the first method I could provide you with some code which does manage the virtual block configuration/scheduling inside the “merged” kernel including the inter-block synchronization, since I’ve been playing with this recently.

Thank you very much for your suggestions. I was developing my program. And it seems to be the program executes good.

Just one thing I noticed. The code represented before is not stable. It mean that host gets wrong values (host_c) after the iteration performing.

for(int i = 0; i < 10000; i++)

{

MatSumGPU<<<dimGrid, dimBlock>>>(dev_a, dev_b, dev_c, COLS, ROWS);  

cudaThreadSynchronize();

cudaMemcpy(host_c, dev_c, SIZE, cudaMemcpyDeviceToHost);                

//... Saving to the file

}

What this error could be connected with?

I wrote simple test which can illustrate the error situtaion I described before.

#include "stdafx.h"

#include <stdio.h>

float3 *host_a;

const int N = 10;

void InitMem()

{

	host_a = (float3 *)malloc(N * sizeof(float3));

	for (int i = 0; i < N; i++)

	{

			host_a[i] = make_float3(0, 0, 0);

	}

}

void FreeMem()

{

	free(host_a);

}

int _tmain(int argc, _TCHAR* argv[])

{

	InitMem();

	for(int i = 0; i < N; i++)

	{

		printf("%i %i %i\n", host_a[i].x, host_a[i].y, host_a[i].z);

	}

	FreeMem();

	return 0;

}

if InitMem() function looks like this

void InitMem()

{

	host_a = (float3 *)malloc(N * sizeof(float3));

	for (int i = 0; i < N; i++)

	{

			host_a[i] = make_float3(0, 0, 0);

	}

}

The program seems to be works correct and after the launching we will see next results:

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

0 0 0

But if I making simple change to

void InitMem()

{

	host_a = (float3 *)malloc(N * sizeof(float3));

	for (int i = 0; i < N; i++)

	{

			host_a[i] = make_float3(1, 0, 0);

	}

}

then I’m getting absolutely wrong result:

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

0 1072693248 0

That’s because your print a float variable as an integer.

Tera thanks. That was stupid question. I have to be attentive.