Can CUDA do sequential processing?

fendi · August 4, 2011, 3:13am

Hi, all
I would like to know if CUDA can do sequential processing, for an illustration:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

if (row>=0 && row<HEIGHT && col>=0 && col<WIDTH) {
if (row==0 && col==0) {
*(d_Label+ row * WIDTH + col) = 10;
} else {
*(d_Label + row * WIDTH + col) = *(d_Label + row * WIDTH + (col-1))+10;
}
}

If the above concept is applied on C/C++, the output value on each position should be a multiply of 10, but when I tried to use the above code on CUDA, the value will only stack on 10. I assume this is because CUDA processes them in parallel. Is there any way that CUDA can do a sequential processing? That is to wait until the first output value computation is done then go to the next output computation.

Thanks for any help.

~Fendi

luis-tec · August 4, 2011, 3:34pm

you would need to synchronize threads:

if (row == 0 && col == 0) {
*(d_Label+ row * WIDTH + col) = 10;
};
__syncthreads();

if (row > 0 && row < HEIGHT && col > 0 && col < WIDTH) {
*(d_Label + row * WIDTH + col) = *(d_Label + row * WIDTH + (col-1))+10;
}

Hope it helps.

Luis Ernesto.

fendi · August 5, 2011, 12:38am

Hi, Luis, thanks for the pointer, but the value I see are all still 10, the output didn’t get an increase one after another. As an additional information, I use a 16x16 BlockDim, and 20x20 GridDim. Perhaps you could give me another hints?

thanks.

~Fendi

luis-tec · August 5, 2011, 3:15pm

Sorry, I didn’t read thoroughly the first time. The only way you could do that is that you launch just one block for do all the work:

unsigned threadId = threadIdx.x;

    unsigned icol = threadId;

    unsigned numberThreads = blockDim.x;

    unsigned numberCycles = (WIDTH - 1)/numberThreads + 1;

    unsigned i, j, k;

    unsigned irow = 0;

// First Row

    if (threadId == 0)

        matrix[0] = 10;

for (i = 1; i < numberThreads; i++) {

        if (i == threadId && icol < WIDTH) {

            matrix[icol] = matrix[icol - 1] + 10;

        };

        __syncthreads();

    };

icol += numberThreads;

for (j = 1; j < numberCycles; j++) {

        for (i = 0; i < numberThreads; i++) {

            if (i == threadId && icol < WIDTH) {

                matrix[icol] = matrix[icol - 1] + 10;

            };

            __syncthreads();

        };

        icol += numberThreads;

    };

icol = threadId;

    irow += WIDTH;

// The rest of rows.

    for (k = 1; k < HEIGHT; k++) {

        for (j = 0; j < numberCycles; j++) {

            for (i = 0; i < numberThreads; i++) {

                if (i == threadId && icol < WIDTH) {

                    matrix[irow + icol] = matrix[irow + (icol - 1)] + 10;

                };

                __syncthreads();

            };

            icol += numberThreads;

        };

        icol = threadId;

        irow += WIDTH;

    };

An surprisingly inefficient way of do it.

fendi · August 6, 2011, 1:57am

Hi, Luis, thanks again for the help. By the statement of only 1 block should doing all the processes, do you mean that I can only have a dimBlock(WIDTH,HEIGHT) and a dimGrid(1,1) ?

If I’m having a large size of WIDTH and HEIGHT, wouldn’t CUDA get pretty slow? Which means if I’m using CUDA for sequential processing, the processing speed won’t be delightful or even messed up.

I’ve tried your code, and I think I got the grasp of it.
When I tried to run the code on with buffer size of (50x50), I can still see some thing, and the result is in sequential order. But when I tried so use a larger buffer, say 1000x2000, it took so long for the CUDA to finish the process, and it even crash after some time (the program exit by itself).

I think this might be caused by the number of threads allowed on each block is very limited, say 256 or 512 according to different GPU type. Can you give me some insight about this? Is there a way for CUDA to do fast sequential processing?

Thanks, I really appreciate your help.

~Fendi

Joky · August 10, 2011, 11:28am

By design, we are targeting with Cuda massively parallel architecture. If your algorithm is intrinsically sequential you have two options :

Run it on a CPU and forgot about Cuda.
Change the algorithm to an equivalent parallel.

For instance in your example, an algorithm change suitable to Cuda would be :

*(d_Label + row * WIDTH + col) = (row * WIDTH + col + 1)*10;

You might also have a look at the parallelization of histogram for instance (CUDA Toolkit Documentation 12.3 Update 1) or at CUDPP(Google Code Archive - Long-term storage for Google Code Project Hosting.) for parallel prefix operations.

About your kernel that exit without completion, there’s a 5 seconds watchdog enabled if an X display is attached to the GPU.

fendi · August 15, 2011, 6:14am

Hi, Joky
Thanks for the suggestion.

about your 1st suggestion for giving up CUDA, it was my last resort. Since my system is pretty big, I have about 10 processes, where eight of them can be done in parallel, but 2 of them must be done in sequential, such that process 1-4 are parallel, 5 is sequential, 6-8 is parallel, 9 is sequential, and 10 is parallel.

I’m still thinking the possibility of changing the algorithm to parallel, but it seems to be no luck. The simple logic behind my algorithm is out[thread] = out[thread-1], that is the computation for the current output value requires the previous output to be calculated first.

I can just divide the processes into two, such as CUDA(for parallel) and CPU(for sequential), but it would require at least twice memory transfer from device to host which is somewhat time expensive.

So I’m wondering if there is a way to do all of them in CUDA with hope that the processing time can be pushed down to its limit.

Thanks for the link about histogram, I’ll take a good look on it.
If there is any additional information that you could share with me, I’d really appreciate it.
Thanks.

~Fendi

B_ron · August 24, 2011, 2:21am

Start your sequential kernel with <<<1, 1>>> block/thread configuration. Obviously no need to use threadIdx/blockIdx, then, and you should resort to registers/shared memory as temporary storage.

Or if the configuration is <<<1, x>>> for whatever reason, the following can be used:

__syncthreads();

if (threadIdx.x == 0)

{

<sequential bit>

}

__syncthreads();

Topic		Replies	Views
CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops Technical Blog	18	1527	September 12, 2019
Is it doable with CUDA? CUDA Programming and Performance	6	565	December 29, 2019
Odd performance problem/question CUDA Programming and Performance	3	834	June 3, 2009
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5980	September 5, 2008
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
Can I Control Thread ID? CUDA Programming and Performance	3	4351	June 9, 2008
Thread Synchronization CUDA Programming and Performance	4	8498	October 2, 2009
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2363	December 9, 2011
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	885	July 27, 2017
A few questions CUDA Beginner CUDA Programming and Performance	8	938	June 9, 2011

Can CUDA do sequential processing?

Related topics