Stream Concurrency (or lack thereof) on GTX 480

I am trying to write a program that takes smaller chunks of data and runs a histogram over them to find the frequency of symbols. This will eventually be used as a portion of a GPU compression algorithm.

I’m trying to look at it from the standpoint of making a single block of data use a single CUDA block, and rely on concurrent kernels to maximize the occupancy of the card. This allows me to compress shorter files without wasting the rest of the GPU, and to keep the GPU full while it’s doing the scatter/gather portions of the algorithm that work best in a single block.

Here’s the problem. When I attempt to split the execution into multiple streams and execute the same kernel for different data in different streams (the kernels only use one block), the kernels serialize behind each other. I would expect it to run the kernels together at the same time on different multiprocessors, but it doesn’t seem to work that way. Even more serious, it won’t overlap communication and processing… the downloads back to main memory wait until all kernels have completed, not just the one in their stream. What am I missing?

[codebox]

#include

#include

#include

#include <sys/time.h>

#include <stdlib.h>

#define MAX_CUDA_SIZE 524288 //512 threads per block * 1024 data per thread

#define THREADS_PER_BLOCK 512

#define BYTES_PER_THREAD 1024

#define NUMBER_OF_STREAMS 4 //should be more than the number of blocks to occupy the GPU

#define DATA_BUCKETS 256 //must be power of 2

typedef unsigned char DATA;

typedef unsigned int uint;

typedef unsigned char uchar;

using namespace std;

double etime() {

struct timeval t;

gettimeofday(&t, NULL);

return t.tv_sec + 1.e-6*t.tv_usec;

}

global void gatherFrequencies(DATA *input, uint inputLength, uint *freqTable)

{

for (int i = threadIdx.x; i < DATA_BUCKETS; i += blockDim.x) freqTable[i] = 0;

__syncthreads();

for (int i = threadIdx.x; i < inputLength; i += blockDim.x) {

	atomicAdd(freqTable + input[i], 1);

	int waste = 0;

	for (waste = 0; waste < 100000; waste++) waste += 1;

	input[i] = waste;

	//freqTable[input[i]]++;

}

}

int main(int argc, char **argv)

{

cudaSetDevice(0);



cout << "Encoding random" << endl;

cudaStream_t streams[NUMBER_OF_STREAMS];

DATA *Hdata[NUMBER_OF_STREAMS], *Ddata[NUMBER_OF_STREAMS];

uint *Hfreq[NUMBER_OF_STREAMS], *Dfreq[NUMBER_OF_STREAMS];

int stream;

for (stream = 0; stream < NUMBER_OF_STREAMS; stream++)

{

	cudaStreamCreate(&streams[stream]);

	cudaMallocHost(&Hdata[stream], MAX_CUDA_SIZE * sizeof(DATA));

	cudaMalloc(&Ddata[stream], MAX_CUDA_SIZE * sizeof(DATA));

	cudaMallocHost(&Hfreq[stream], DATA_BUCKETS * sizeof(int));

	cudaMalloc(&Dfreq[stream], DATA_BUCKETS * sizeof(int));

	for (int i = 0; i < NUMBER_OF_STREAMS; i++)

		Hfreq[stream][i] = 43;

}	

stream = 0;

cudaEvent_t preupload[NUMBER_OF_STREAMS];

cudaEvent_t upload[NUMBER_OF_STREAMS];

cudaEvent_t pregather[NUMBER_OF_STREAMS];

cudaEvent_t gather[NUMBER_OF_STREAMS];

cudaEvent_t predownload[NUMBER_OF_STREAMS];

cudaEvent_t download[NUMBER_OF_STREAMS];

for (int i = 0; i < NUMBER_OF_STREAMS; i++) {

	cudaEventCreate(&preupload[i]);

	cudaEventCreate(&upload[i]);

	cudaEventCreate(&pregather[i]);

	cudaEventCreate(&gather[i]);

	cudaEventCreate(&predownload[i]);

	cudaEventCreate(&download[i]);

}



int bytesRead = MAX_CUDA_SIZE*sizeof(DATA);

cout.setf(ios::fixed);

cout << setprecision(3);

cout << setw(8);

cout << "Processing " << bytesRead << " bytes of data.\n";

for (stream = 0; stream < NUMBER_OF_STREAMS; stream++)

	for (int i = 0; i < MAX_CUDA_SIZE; i++)

		Hdata[stream][i] = rand();

int nStreams = stream;

double s = etime();

for (stream = 0; stream <  nStreams; stream++)

{

	double a = etime();

	cudaEventRecord(preupload[stream], streams[stream]);

	cudaMemcpyAsync(Ddata[stream], Hdata[stream], bytesRead, cudaMemcpyHostToDevice, streams[stream]); //checkError("Copy to device");

	cudaEventRecord(upload[stream], streams[stream]);

	double b = etime();

	cout << "Upload queued " << (b-s)*1000000 << " - " << (a-s)*1000000 << " = " << (b-a)*1000000 << endl;

}

	

for (stream = 0; stream <  nStreams; stream++)

{

	double a = etime();

	cudaEventRecord(pregather[stream], streams[stream]);

	gatherFrequencies<<<1, 512, 0, streams[stream]>>>(Ddata[stream], bytesRead/sizeof(DATA), Dfreq[stream]); //checkError("gatherFrequencies kernel");

	cudaEventRecord(gather[stream], streams[stream]);

	double b = etime();

	cout << "Histogram complete " << (b-s)*1000000 << " - " << (a-s)*1000000 << " = " << (b-a)*1000000 << endl;

}

	

cout << "retrieving counts\n";

for (stream = 0; stream <  nStreams; stream++)

{

	double a = etime();

	cudaEventRecord(predownload[stream], streams[stream]);

	cudaMemcpyAsync(Hfreq[stream], Dfreq[stream], DATA_BUCKETS * sizeof(int), cudaMemcpyDeviceToHost, streams[stream]);

	cudaEventRecord(download[stream], streams[stream]);

	double b = etime();

	cout << "download complete " << (b-s)*1000000 << " - " << (a-s)*1000000 << " = " << (b-a)*1000000 << endl;

}

cudaThreadSynchronize();

	

float fpreupload, fupload, fpregather, fgather, fpredownload, fdownload;



for (stream = 0; stream < nStreams; stream++)

{

	cudaEventElapsedTime(&fpreupload, preupload[0], preupload[stream]);

	cudaEventElapsedTime(&fupload, preupload[0], upload[stream]);

	cudaEventElapsedTime(&fpregather, preupload[0], pregather[stream]);

	cudaEventElapsedTime(&fgather, preupload[0], gather[stream]);

	cudaEventElapsedTime(&fpredownload, preupload[0], predownload[stream]);

	cudaEventElapsedTime(&fdownload, preupload[0], download[stream]);

	cout << stream << ": upload " << fupload << " - " << fpreupload << " = " << (fupload-fpreupload) << endl;

	cout << stream << ": gather " << fgather << " - " << fpregather << " = " << (fgather-fpregather) << endl;

	cout << stream << ": download " << fdownload << " - " << fpredownload << " = " << (fdownload-fpredownload) << endl;

}



for (stream = 0; stream < NUMBER_OF_STREAMS; stream++)

{

	cudaFreeHost(Hdata[stream]);

	cudaFree(Ddata[stream]);

	cudaFreeHost(Hfreq[stream]);

	cudaFree(Dfreq[stream]);

}

}

[/codebox]

And the output:

[codebox]

nvcc -o histogram -O3 -arch compute_20 -L/usr/local/cuda/lib64 -lcudart encode.cu

./histogram

Encoding random

Processing 524288 bytes of data.

Upload queued 41.962 - 7.153 = 34.809

Upload queued 300.169 - 282.049 = 18.120

Upload queued 382.185 - 360.966 = 21.219

Upload queued 462.055 - 440.121 = 21.935

Histogram complete 846.148 - 566.006 = 280.142

Histogram complete 931.978 - 907.183 = 24.796

Histogram complete 1017.094 - 991.106 = 25.988

Histogram complete 1099.110 - 1075.983 = 23.127

retrieving counts

download complete 1240.969 - 1219.034 = 21.935

download complete 1309.156 - 1300.097 = 9.060

download complete 1393.080 - 1384.020 = 9.060

download complete 1464.128 - 1456.022 = 8.106

0: upload 0.200 - 0.000 = 0.200

0: gather 3051.659 - 0.822 = 3050.837

0: download 12224.588 - 12224.565 = 0.022

1: upload 0.450 - 0.264 = 0.187

1: gather 6115.826 - 3051.661 = 3064.164

1: download 12224.611 - 12224.590 = 0.021

2: upload 0.635 - 0.452 = 0.182

2: gather 9170.156 - 6115.828 = 3054.329

2: download 12224.634 - 12224.613 = 0.021

3: upload 0.820 - 0.637 = 0.183

3: gather 12224.563 - 9170.158 = 3054.405

3: download 12224.656 - 12224.636 = 0.021

[/codebox]

The most telling is the stream 0 download not starting until after the stream 3 gather has completed. (Those last 12 lines are end - start = duration). Notice how the order of the timestamps follows exactly the order of enqueueing (0: upload, 1: upload, 2: upload, 3: upload, 0: gather, 1: gather…). Any ideas?

There was a recent thread about a bug in CUDA 3.1 that makes cudaMemcpyAsync not asynchronous. Are you using CUDA 3.1? You might see if you get any differences for CUDA 3.0, although there you will run into the 4 concurrent kernel limit.

Hmmm, I updated to 3.1 because I couldn’t get it to work in 3.0 either. Then again, I don’t think I found out that CUDA_PROFILE=1 serializes everything until after I updated, so it may have been the culprit before. I’ll give it a shot when I get back to it tonight.

With that said, it still doesn’t explain why the kernel execution didn’t overlap, unless everything is affected by that bug.

So, downgrading to CUDA 3.0 did help, sort of. The pre-timers (which technically measure the time after the last entry in the stream executed, as far as I know), all fired off right after the initial cudaMemcpyAsync completed. However, the kernel completion time still shows an obvious serialization of the kernels, and the download back to main memory still waited for all the kernels to complete before starting.

Here’s the output:

[codebox]

nvcc -o histogram -O3 -arch compute_20 -L/usr/local/cuda/lib64 -lcudart encode.cu

./histogram

Encoding random

Processing 524288 bytes of data.

Upload queued 32.902 - 0.000 = 32.902

Upload queued 158.072 - 145.912 = 12.159

Upload queued 175.953 - 167.847 = 8.106

Upload queued 205.994 - 195.026 = 10.967

Histogram complete 257.969 - 213.861 = 44.107

Histogram complete 272.989 - 266.075 = 6.914

Histogram complete 288.010 - 279.903 = 8.106

Histogram complete 300.884 - 294.924 = 5.960

retrieving counts

download complete 320.911 - 309.944 = 10.967

download complete 331.879 - 329.018 = 2.861

download complete 342.846 - 339.031 = 3.815

download complete 352.859 - 349.998 = 2.861

0: upload 0.233 - 0.000 = 0.233

0: gather 3058.948 - 0.862 = 3058.086

0: download 12231.388 - 12231.345 = 0.043

1: upload 0.440 - 0.143 = 0.296

1: gather 6120.845 - 0.883 = 6119.962

1: download 12231.432 - 12231.347 = 0.085

2: upload 0.641 - 0.163 = 0.478

2: gather 9176.970 - 0.904 = 9176.065

2: download 12231.472 - 12231.349 = 0.123

3: upload 0.841 - 0.205 = 0.636

3: gather 12231.343 - 0.925 = 12230.418

3: download 12231.506 - 12231.351 = 0.155

[/codebox]

Anyone have any more ideas?

Record the cumulative kernel execution time rather than the kernel execution time per stream. See if that changes anything.

Bah! I commented out the cudaEventRecords around the kernel execution, and this is what I get:

[codebox]

nvcc -o histogram -O3 -arch compute_20 -L/usr/local/cuda/lib64 -lcudart encode.cu

./histogram

Encoding random

Processing 524288 bytes of data.

Upload queued 35.048 - 0.954 = 34.094

Upload queued 234.127 - 228.167 = 5.960

Upload queued 286.102 - 277.996 = 8.106

Upload queued 334.024 - 327.110 = 6.914

Histogram complete 450.134 - 406.027 = 44.107

Histogram complete 497.103 - 489.950 = 7.153

Histogram complete 555.992 - 549.078 = 6.914

Histogram complete 601.053 - 595.093 = 5.960

retrieving counts

download complete 709.057 - 689.030 = 20.027

download complete 748.158 - 741.959 = 6.199

download complete 813.007 - 807.047 = 5.960

download complete 857.115 - 852.108 = 5.007

0: upload 0.213 - 0.000 = 0.213

0: gather 0.000 - 0.000 = 0.000

0: download 3054.231 - 3054.201 = 0.031

1: upload 0.429 - 0.210 = 0.219

1: gather 0.000 - 0.000 = 0.000

1: download 3056.320 - 3056.291 = 0.030

2: upload 0.627 - 0.267 = 0.360

2: gather 0.000 - 0.000 = 0.000

2: download 3056.440 - 3056.410 = 0.030

3: upload 0.821 - 0.320 = 0.501

3: gather 0.000 - 0.000 = 0.000

3: download 3060.393 - 3060.363 = 0.029

[/codebox]

The zeroes for the gather calls are because I didn’t record those events. But look at the start time for the download calls. They all start together, 3 seconds into the execution (rather than the 12 I have been getting). So it seems cudaEventRecord synchronizes on stream 0 (all streams) rather than just the stream you specify. This seems to be a bug. The spec states:

“If stream is non-zero, the event is recorded after all preceding operations in the stream have been completed; otherwise, it is recorded after all preceding operations in the CUDA context have been completed.”

I have verified that the streams are numbered 1-4, yet the addition of the record statement causes the streams to synchronize across the context.

Now I have verified that I get the same result in CUDA 3.1. I am overlapping my kernel execution. I also get ~4x performance increase, so I am overlapping in accordance to the 4 slot to 16 slot bump. But not quite… it seems like for the number of MP’s - 1 you get consistent behavior (I guess -1 because there is some extra computation happening on the other MP from the partitioning - when you add the 15th stream, the time goes up by 50%). After that, there starts to be some variability, probably depending on exactly how the threads get split into 15 MPs and 16 concurrent slots and 45 occupancy slots.

Either way, at this point, concurrent kernel execution doesn’t allow you to split the application into single blocks and let CKE handle filling out the occupancy of the hardware. There’s just not enough concurrent slots yet. 32 slots would get us there, if you make your block size 768. Then you can have 1536 threads in flight per MP, and 30 blocks filling a GTX 480 to the brim (assuming you can keep the register count / local memory low enough). You can get it 2/3 of the way there with 1024 size blocks, but I have yet to figure out why the maximum block size is less than the maximum thread count per MP.

The take home is this: the Heisenburg Uncertainty Principle applies to Concurrent Kernel Execution. You can’t know both the precise timing and concurrency of two streams. The simple act of measuring the timing (in any way) destroys the concurrency of the streams. I believe we need a way to measure the timing of concurrent streams, but that may be easier said than done.

In short, I’m going back to trying to use the entire GPU in a single kernel execution, and saving concurrent streams for (in my case) compressing two large files in parallel, rather than for each block of the file.

Now on to my next portion… 64k+ bucket histograms and optimal prefix sums - toward an arithmetic/range encoder.

Well, here’s something else you can try. Do three loops instead of one: record the event in every stream, launch the kernel in every stream, record the next event in every stream.