Asynchronous memory transfers with several streams

I want to calculate the cosine of a bunch of input values as fast as possible, using asynchronous memory transfers, so that the GPU can be computing and transferring at the same time. This is somewhat of a starting point; later on I would like to perform operations on several input arrays i1, i2, …, iN to produce several output arrays o1, o2, …, oM. For a given array index K, o1[K], o2[K], … would only be dependent on i1[K], i2[K], … .

At this point, I have the following code which computes the cosine of 31*2^20 floating point values, and compares them with the CPU:

#include <stdio.h>

#include <cutil.h>

#include <math.h>

__global__ void mycos(float *input, float *output, int size){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

  output[i] = cosf(input[i]);

}

float RandFloat(float low, float high){

  float t = (float)rand() / (float)RAND_MAX;

  return (1.0f - t) * low + t * high;

}

int main(int argc, char *argv[]){

   

    cudaSetDevice(1);

	dim3 threads, blocks;         

	int dataSize = 1024*1024*31;

	int myStreams = 8, errors = 0;

	float *CPU_IN, *GPU_IN, *CPU_OUT, *GPU_OUT, time_memcpy;

	CUDA_SAFE_CALL( cudaMallocHost((void**)&CPU_IN, sizeof(float)*dataSize) ); 

	CUDA_SAFE_CALL( cudaMallocHost((void**)&CPU_OUT, sizeof(float)*dataSize) ); 

	for(int i = 0; i < dataSize; i++){

  CPU_IN[i] = RandFloat(-1.0, 1.0);

	}

	CUDA_SAFE_CALL( cudaMalloc((void**)&GPU_IN, dataSize*sizeof(float)) );

	CUDA_SAFE_CALL( cudaMalloc((void**)&GPU_OUT, dataSize*sizeof(float)) );

	cudaEvent_t start_event, stop_event;

    CUDA_SAFE_CALL( cudaEventCreate(&start_event) );

    CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );

	

	cudaEventRecord(start_event, 0); 

	cudaStream_t *streams2 = (cudaStream_t*) malloc(myStreams * sizeof(cudaStream_t));

	for(int i = 0; i < myStreams; i++){

        CUDA_SAFE_CALL( cudaStreamCreate(&(streams2[i])) );

	}

	for(int i = 0; i < myStreams; i++){

  cudaMemcpyAsync(GPU_IN + (i * dataSize / myStreams), CPU_IN + (i * dataSize / myStreams), sizeof(float)*dataSize/myStreams, cudaMemcpyHostToDevice, streams2[i]);

	}

	

	for(int i = 0; i < myStreams; i++){

  mycos<<<dataSize/512, 512, 0, streams2[i]>>>(GPU_IN, GPU_OUT, dataSize/myStreams);

	}

	for(int i = 0; i < myStreams; i++){

  cudaMemcpyAsync(CPU_OUT + i * dataSize / myStreams, GPU_OUT + i * dataSize / myStreams, sizeof(float)*dataSize/myStreams, cudaMemcpyDeviceToHost, streams2[i]);

	}

	cudaEventRecord(stop_event, 0);

    cudaEventSynchronize(stop_event);

	CUDA_SAFE_CALL( cudaEventElapsedTime(&time_memcpy, start_event, stop_event) );

    cudaThreadSynchronize();

	printf("Total Time:\t%.2f\n", time_memcpy);

	

	for(int i = 0; i < dataSize; i++){

  if (fabs(cos(CPU_IN[i]) - CPU_OUT[i])>0.001){

  	errors++;

  }

	}

    printf("Errors = %d out of a possible %d.\n",errors,dataSize);

	CUT_EXIT(argc, argv);

    return 0;

}

It is loosely based on the simpleStreams code. I have cudaSetDevice(1); so that the program runs on a 8500GT instead of a 9600GT, as the latter is the primary display driver. If the code crashes, the whole system won’t lock up, as it has a few times on me. Both cards have 512MB of memory.

If I set the input data size to 32*2^20, I start getting errors all over the place, even though I am able to fit all that data onto either of the cards. How can I fix that?

In section 4.5.2.4 of the programming guide, there is a code example of how to manage streams, but it doesn’t list how to code the kernel so that streams, blocks, and threads can all be related to the indexes of the input and output arrays. How can I fix the kernel and its call in the main program so that I don’t get errors (caused by index mismatching or data not being processed at all)?

hi, i haven’t coded on streams, but ur code seems nice;) for debug, you might try:
use small data first, 32(not 32M), 2 streams. use emuDebug to step through the whole code.
then small data + Debug, copy results to host, see if it’s expected.
then large data + debug. 32M ints = 128MB seems not overflowing the gram, but you may bisearch the verge size of breakdown.

That’s a good draft, but a couple of observations:

  • the kernel kept getting launched over the same area of memory (it kept getting fed the same input/output addresses) – this was the main error

  • you should check for runtime errors at various points (cudaGetLastError) which will give you hints as to if you are running out of memory, had a launch failure, etc.

  • you should have each thread do more work. instead of each thread simply computing one cosine, have each thread run a for-loop and calculate a bunch of cosines. (I didn’t do this in the version below)

  • in case the number of grids/blocks doesn’t divide nicely, you should round up and have each thread check if it’s reading/writing out of bounds. that divup() function is handy

Here’s an updated version:

#include <stdio.h>

#include <cutil.h>

#include <math.h>

__global__ void mycos(float *input, float *output, int n){

    int i = blockIdx.x * blockDim.x + threadIdx.x;

    if (i < n) /* in bounds? */

        output[i] = cosf(input[i]);

}

// ...

int divup(int a, int b)    { return (a / b) + (a % b); }

int main(int argc, char *argv[]){

    // ....

    cudaStream_t *streams2 = (cudaStream_t*) malloc(myStreams * sizeof(cudaStream_t));

    for(int i = 0; i < myStreams; i++){

        CUDA_SAFE_CALL( cudaStreamCreate(&(streams2[i])) );

    }

   printf("last error %d\n", cudaGetLastError());

    cudaThreadSynchronize();

    cudaEventRecord(start_event, 0);

    for(int i = 0; i < myStreams; i++) {

        float *gpu_in  = GPU_IN + (i * dataSize / myStreams);

        float *gpu_out = GPU_OUT + (i * dataSize / myStreams);

        float *cpu_in  = CPU_IN + (i * dataSize / myStreams);

        float *cpu_out = CPU_OUT + (i * dataSize / myStreams);

        int n = dataSize / myStreams;

       cudaMemcpyAsync(gpu_in, cpu_in, n*sizeof(float), cudaMemcpyHostToDevice, streams2[i]);

        mycos<<<divup(n,256), 256, 0, streams2[i]>>>(gpu_in, gpu_out, n);

        cudaMemcpyAsync(cpu_out, gpu_out, n*sizeof(float), cudaMemcpyDeviceToHost, streams2[i]);

    }

    cudaEventRecord(stop_event, 0);

    cudaEventSynchronize(stop_event);

   CUDA_SAFE_CALL( cudaEventElapsedTime(&time_memcpy, start_event, stop_event) );

    cudaThreadSynchronize();

    printf("Total Time:\t%f   (%d)\n", time_memcpy, cudaGetLastError());

    // ....

}

For the record, I was wrong about this. In a loop, I had each thread run over its own span and this had terrible performance, anywhere from 20-70x slower using this span idea. I tried playing with various lenghths of span. I’m on a MacBook Pro with 256mb GeForce 8600M GT.

#define SPAN  32

__global__ void mycos_span(float *input, float *output, int n){

    const int base = blockIdx.x * blockDim.x + threadIdx.x * SPAN;

    for (int i = 0; i < SPAN && (base + i < n); i++)

        output[base+i] = cosf(input[base+i]);

}

// ...

mycos_span<<<divup(n,512*SPAN), 512, 0, streams2[i]>>>(gpu_in, gpu_out, n);

Also, it may be unlikely that you’ll want more than a couple of streams. For example, if it can only overlap two tasks–memory transfer and kernel execution–then two or three streams should provide it with enough opportunity to overlap. On my laptop, I’m getting baseline performance with one stream, best performance with two or three, and slightly worse than the 1-stream baseline when using four or more.

I haven’t done much with streams, and I’m on a laptop, so someone correct me if I’m wrong here. With the seemingly infinite ways to tweak CUDA for better or worse, I may have constructed an unfair or poor comparison.

It’s 20-70x slower because you are breaking CUDA optimization rule #0: you are no longer coalescing your memory accesses.

Still, even if you did coalesce your accesses, putting more work into each thread would likely slow things down some. It always has in my experience. There is no overhead for thread switching on the GPU and it gets its performance from interleaving 10’s of thousands of threads. The general rule of thumb is to process one array element per threa: especially when you need to maximize the memory bandwidth (likely what is limiting the cosine kernel).

Agreed, it is poor practice, but that part (reading from one global and writing to another) was kept the same among all runs. So it was always uncoalesced, and yet it still was running 20-70x slower. So I guess that slowdown was mostly explained in that it was faster to schedule more threads compute 1-to-1 than to have existing threads compute 1-to-n. Sound reasonable?

So I went back and did a coalesced version (below) but it was slightly slower than the original global-to-global version. It’s slightly slower if I keep that unnecessary __syncthreads() in there, which I thought would sequence the reads and writes to provide coallescing. I’m guessing that I’m simply coallescing wrong, true?

__global__ void mycos_coallesced(float *input, float *output, int n){

    __shared__ float tmp[512];

    const int x = blockIdx.x * blockDim.x + threadIdx.x;

    if (x < n)

        tmp[threadIdx.x] = cosf(input[x]);

/*     __syncthreads(); */

    if (x < n)

        output[x] = tmp[threadIdx.x];

}

Josh, What’s the optimal way to write a simple “expression” kernel like this?

Oh, I thought this was your original code:

__global__ void mycos(float *input, float *output, int n){

   int i = blockIdx.x * blockDim.x + threadIdx.x;

   if (i < n) /* in bounds? */

       output[i] = cosf(input[i]);

}

Which is coalesced (assuming blockDim.x is a multiple of 32). You don’t need __syncthreads() for coalescing as it only happens at the warp level.

Thanks for the help!

I now have two kernels:

__global__ void mycos(float *input, float *output, int n){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	if (i < n)	

	output[i] = cosf(input[i]);

}

__global__ void mycos_coallesced(float *input, float *output, int n){

   __shared__ float tmp[512];

   const int x = blockIdx.x * blockDim.x + threadIdx.x;

   if (x < n)

       tmp[threadIdx.x] = cosf(input[x]);

   if (x < n)

       output[x] = tmp[threadIdx.x];

}

mycos, oddly enough takes 216ms, while the coallesced version takes 240 ms, datasize is 31*2^20, running on the 8500GT, 512MB, @ PCIe 1.0 x8.

I still have errors as I go to 32*2^20, but when I call the kernel, I have 65536 blocks with 512 threads each. Do I have too many blocks?

Sorry for drifting a little OT in my previous posts:

Your kernel mycos() is going to be the fastest possible version. It does have coalesced reads and writes. You don’t need to use shared memory in this kernel because you don’t share any values amoung multiple thread in the block. I was referring to the previous kernel with the SPAN loop per thread when I pointed out that it was uncoalesced.

And yes, you are running too many blocks. 65535 is the max in the x direction. You can use a 2D block index to get more threads.

Okay, I’ll use 2D blocks if I need more than 65535 blocks. But now I want to use double precision values on a GTX 260. I have the following double precision kernel:

__global__ void mycosD(double *input, double *output, int n){

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	//output[i] = input[i];

	output[i] = cos(input[i]);

}

I get really strange numbers if I use this, but if I uncomment the line in there and just copy the input to the output, the values are just fine. I’m using more or less the same architecture with streams an so on as before.

Are you compiling with -arch sm_13?

Forgot about that, thanks!

Everything seems to be working okay now.