Problem using cuda streams

Hi everyone,
I have some problems using cuda streams. Could someone please help?
I’ve developed a kernel that perform the rotation of non-squared images (2336x1200) of 90 degrees:

__global__ void kernel(BYTE *odata, BYTE *idata, int width, int height)
    __shared__ BYTE block[TILE_DIM][TILE_DIM+1];

	int col = blockIdx.x * TILE_DIM + threadIdx.x;
	int row = blockIdx.y * TILE_DIM + threadIdx.y;
	for (int i=0; i<TILE_DIM; i+= BLOCK_ROWS){

	if((col < width) && (row < height))
			int tid_in = row * width + col;
			block[threadIdx.y][threadIdx.x] = idata[tid_in+i*IMAGE_WIDTH];

	col = blockIdx.y * TILE_DIM + threadIdx.x;
	row = blockIdx.x * TILE_DIM + threadIdx.y;

	for(int i=0; i<TILE_DIM; i+= BLOCK_ROWS){

	if((col < height) && (row < width))
			//90° anticlockwise
			//int tid_out = (width-row-1) * height + col;

			//90° clockwise
			int tid_out = row * height + height - col - 1;
			odata[tid_out+i*IMAGE_HEIGHT] = block[threadIdx.x][threadIdx.y];

I would like to apply this kernel to a block of images of fixed size (for example a block of 45 images).
Following the post i’ve written this code:

float ms;
    //IMAGE_WIDTH = 2336, IMAGE_HEIGHT = 1200, NIMAGES = 30

    // size of memory required to store the matrix
    const  int mem_size = sizeof(BYTE) * n;

    //host input and output data
    BYTE *h_idata,*h_odata;
    //device input and output data
    BYTE *d_idata,*d_odata;
   checkCudaErrors(cudaHostAlloc((void **)&h_idata, mem_size, cudaHostAllocWriteCombined));
	 checkCudaErrors(cudaHostAlloc((void **)&h_odata, mem_size, cudaHostAllocDefault));
	 checkCudaErrors(cudaMalloc((void **) &d_idata, NSTREAMS * chunk_size*sizeof(BYTE)));
	 checkCudaErrors(cudaMalloc((void **) &d_odata, NSTREAMS * chunk_size*sizeof(BYTE)));

// initalize host input data

	cudaEvent_t start, stop;
	cudaStream_t stream[NSTREAMS];
	double dim_grid_x = ((double)IMAGE_WIDTH/TILE_DIM);
	double dim_grid_y = ((double)IMAGE_HEIGHT/TILE_DIM);
	dim3 grid(ceil(dim_grid_x),ceil(dim_grid_y),1), threads(TILE_DIM,BLOCK_ROWS,1);
	const int streamSize = n / NSTREAMS;
	const int streamBytes = streamSize * sizeof(BYTE);
	for (int i = 0; i < NSTREAMS; ++i)
    checkCudaErrors( cudaStreamCreate(&stream[i]) );

	checkCudaErrors( cudaEventCreate(&start) );
	checkCudaErrors( cudaEventCreate(&stop) );

	checkCudaErrors( cudaEventRecord(start,0) );
	for (int i = 0; i < NSTREAMS; ++i) 
		//int offset = i * streamSize;
		checkCudaErrors( cudaMemcpyAsync(&d_idata[i * streamSize], &h_idata[i * streamSize], streamBytes, cudaMemcpyHostToDevice,stream[i]) );

	for (int i = 0; i < NSTREAMS; ++i)
		//int offset = i * streamSize;
		kernel<<<grid, threads, 0, stream[i]>>>(d_odata, d_idata, IMAGE_WIDTH, IMAGE_HEIGHT,i);

	for (int i = 0; i < NSTREAMS; ++i)
		//int offset = i * streamSize;
		checkCudaErrors( cudaMemcpyAsync(&h_odata[i * streamSize], &d_odata[i * streamSize], streamBytes, cudaMemcpyDeviceToHost,stream[i]) );
	checkCudaErrors( cudaEventRecord(stop, 0) );
	checkCudaErrors( cudaEventSynchronize(stop));
	checkCudaErrors( cudaEventElapsedTime(&ms, start, stop) );
	printf("Computation Time (ms): %f \n\n", ms);

	checkCudaErrors( cudaEventDestroy(start));
	checkCudaErrors( cudaEventDestroy(stop) );

	for (int i = 0; i < NSTREAMS; ++i)
		checkCudaErrors( cudaStreamDestroy(stream[i]) );

The code works correctly only if i set a number of streams exactly equal to the number of images to rotate (NSTREAMS = NIMAGES).
On the contrary, if i have a number of images per stream greater than 1 (NSTREAMS = 15, NIMAGES = 45), the rotation don’t work as expected: only the first 15 images are ok.
How can i solve this problem?? I’m new on using cuda so i’ll appreciate any suggestions.

Thanks in advance!!


you have 2 dimensions in your code/ project - the number of streams used; the number of blocks/ chunks of work

the way it is now, you are essentially equating the one equal to the other, as you only declare and define 1 dimension really
a number of sections in your code are dependent on the number of streams; when they are essentially dependent on the number of blocks
hence, the code only works when streams == blocks

you need to introduce the other dimension - block pointer - so that you can separately alter the stream pointer and block pointer, without the one automatically influencing the other
in plain words, the block pointer normally continues to increment up to the number of blocks, whilst the stream pointer normally rolls over at a smaller count

and why do you actually use 3 separate for loops to issue h2d memory copies, kernel launches, d2h memory copies?
i would think that a single loop to mind all 3 tasks would mean more space between kernel launches, which is good to offset kernel global memory reads…?

Thanks for the reply!!:)
the use of 3 separate loops isn’t a final solution:i’m testing different configurations to evaluate experimentally the one that provides the best performance in terms of total time of execution, that is the most critical aspect of my application.
I read, in this regard, that the introduction of the third dimension in the kernel code isn’t an optimal choice for the purpose of optimizing it.Am i wrong??
Do you have any suggestions to speed up the kernel execution and/or the data transfer between host and device?
Currently i’m using an ASUS GeForce® GTX 780 graphic card (only 1 copy engines and so it isn’t capable of concurrent data transfer).

I’d be interested in seeing the reference for this.

“the introduction of the third dimension in the kernel code”

what is “the third dimension in the kernel code”?

perhaps i should not have referred to the number of streams and number of blocks as dimensions, but simply as factors
and although i did make the distinction, i should perhaps again note that i generally refer to blocks where you refer to chunks; either way this is completely separate from kernel dimensions in general
i like to refer to blocks, because the work is divided into blocks, and because the data is divided into blocks - something like data chunks sounds silly, to me at least

hence, i meant this:

checkCudaErrors(cudaMalloc((void **) &d_odata, NSTREAMS * chunk_size*sizeof(BYTE)));

this is indicated to be dependent on the number of streams, but it is really conditional on the number of blocks (chunks); so the code only works when streams == chunks

and i really meant this:

number_of_blocks = number_of_chunks;

block_pointer = 0;
stream_pointer = 0;

for (1 to number_of_blocks)
cudaMemcpyAsync(d, h, cudaMemcpyHostToDevice, stream[stream_pointer]);

the_kernel<<<dG, dB, 0, stream[stream_pointer]>>>(the_kernel_parameters);

cudaMemcpyAsync(h, d, cudaMemcpyDeviceToHost, stream[stream_pointer]);


if (stream_pointer == number_of_streams)
stream_pointer = 0;


evidently, the block pointer - generally used as a pointer to the next block to launch, and data offset pointer - can now change independently of the stream pointer - generally used to point to the next stream to use

blocks, work blocks or chunks, is an additional dimension - or factor - to be accounted for in addition to thread blocks of kernel dimensions
work blocks may be related to thread blocks, as the former would likely then also shape the latter
but the former is separate from, and not identical to the latter

Sorry little_jimmy…I misunderstood your explanation of the previous post :)
I’ll try to follow the guidelines of your example!
Thank you for your attention!

Hi Little Jimmy,
i’ve rewritten my code following your advices:

int block_pointer = 0;
int stream_pointer = 0;
int img_size = img_width*image_height;

	for (int i = 0; i < NIMAGES ; ++i) 
		//int offset = i * streamSize;
		checkCudaErrors( cudaMemcpyAsync(d_idata+block_pointer*img_size, h_idata+block_pointer*img_size, img_size, cudaMemcpyHostToDevice,stream[stream_pointer]) );
		//int offset = i * streamSize;
		kernel<<<grid, threads, 0, stream[stream_pointer]>>>(d_odata+block_pointer*img_size, d_idata+block_pointer*img_size, IMAGE_WIDTH, IMAGE_HEIGHT);
		//int offset = i * streamSize;
		checkCudaErrors( cudaMemcpyAsync(h_odata+block_pointer*img_size, d_odata+block_pointer*img_size, img_size, cudaMemcpyDeviceToHost,stream[stream_pointer]) );


		if (stream_pointer == NSTREAMS)
			stream_pointer = 0;


Is it correct?? In any case now, using a number of streams different from the number of images to rotate, the kernel works correctly.
Thank you for your attention!

“Is it correct??”

rather a philosophical question, if you ask me

next, you could reduce the device memory footprint, by introducing yet another dimension - a memory pointer
you generally only need to allocate device memory to contain the working arrays of a work-block - i.e. that used by a single kernel - times the number of streams, times 2, to prevent assigning more device memory than you would really need at any given time, whilst simultaneously preventing kernels from stepping on each others’ toes, without the need for/ use of additional synchronization

int block_pointer = 0;
int stream_pointer = 0;
int memory_pointer = 0;

for (int i = 0; i < NIMAGES ; ++i)
checkCudaErrors( cudaMemcpyAsync(d_idata+memory_pointerimg_size, h_idata+block_pointerimg_size, img_size, cudaMemcpyHostToDevice,stream[stream_pointer]) );

if (memory_pointer == (NSTREAMS * 2))
memory_pointer = 0;

hence, the size of d_idata only really needs to be img_size * number_of_streams * 2, and not: img_size * number_of_work_blocks

this is helpful when the number of work blocks becomes (really) large

I just implemented also this approach: in my case is a key element because I need to work on large amounts of data.
Now everything seems to work as expected.
Thanks a lot for your attention!!!;)

i may have made a mistake
to prevent kernels stepping on each others’ toes, i said 2 times working arrays
but i am preoccupied with occasionally issuing (some of) the memory copies in a separate, common stream
for such a case, it would be 2 times working arrays
for your more simpler case, 1 times working arrays is fine

just note what the profiler has to say about your global memory reads; you may need to pad your grand 1d array to attain alignment
in plain words, a particular kernel would find its data in device memory at block_pointer * padded_image_size, rather than block_pointer * image_size
just see what the profiler says

I launched the profiler as you suggested and i’ve obtained the following results:


Analyzing the section about global memory, I could not figure out if I execute aligned accesses (i’m new with the NSIGHT tool).
However the profiler reports me some bank conflicts accessing the shared memory.
Any suggestions about how can i solve them?

i have not studied your shared memory accesses in depth, but a) i would imagine that non-ideal shared memory access is less painful than non-ideal global memory access, b) you are doing rotations, and part of the strategy is to rather move non-ideal access to shared memory, away from global memory; hence i am not sure whether you would be able to not bleed in some way

your global memory transactions/ requests ratios suggest that alignment may be an issue
thinking of a simple test to verify this, one could, for test purposes, reduce the number of blocks processed - and thus kernels run - to 1, as the 1st block ought to be naturally aligned
but this would be conditional on the dimensions of your kernels, as well as IMAGE_WIDTH; TILE_DIM

i think it is safe to say, if the 1st thread in a warp has a global memory address such that:
thread_global_memory_address % warp size > 0
then you may have an alignment issue if a sufficient amount of threads of the warp participate in the global memory request, as you may end up needing multiple transactions to service the request
this is not a function of cudaMalloc, as cudaMalloc naturally aligns
it is a function of participating threads, and how threads offset themselves within the global memory array

“i think it is safe to say, if the 1st thread in a warp has a global memory address such that:
thread_global_memory_address % warp size > 0…”

how can i read the global memory address of threads?

if the alignment was a problem, what kind of solutions can i test to solve or at least reduce the problem?
i’m interested to the question not only for this case study but also for future kernels that I’m going to develop.

“how can i read the global memory address of threads?”

look at how threads access global memory. like here:


hence, IMAGE_WIDTH; TILE_DIM must be warp size multiples

you can manually pad your arrays
or, you could use apis like cudaMallocPitch and others