Can CUDA do sequential processing?

Hi, all
I would like to know if CUDA can do sequential processing, for an illustration:

int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;

if (row>=0 && row<HEIGHT && col>=0 && col<WIDTH) {
if (row==0 && col==0) {
*(d_Label+ row * WIDTH + col) = 10;
} else {
*(d_Label + row * WIDTH + col) = *(d_Label + row * WIDTH + (col-1))+10;
}
}

If the above concept is applied on C/C++, the output value on each position should be a multiply of 10, but when I tried to use the above code on CUDA, the value will only stack on 10. I assume this is because CUDA processes them in parallel. Is there any way that CUDA can do a sequential processing? That is to wait until the first output value computation is done then go to the next output computation.

Thanks for any help.

~Fendi

you would need to synchronize threads:

if (row == 0 && col == 0) {
*(d_Label+ row * WIDTH + col) = 10;
};
__syncthreads();

if (row > 0 && row < HEIGHT && col > 0 && col < WIDTH) {
*(d_Label + row * WIDTH + col) = *(d_Label + row * WIDTH + (col-1))+10;
}

Hope it helps.

Luis Ernesto.

Hi, Luis, thanks for the pointer, but the value I see are all still 10, the output didn’t get an increase one after another. As an additional information, I use a 16x16 BlockDim, and 20x20 GridDim. Perhaps you could give me another hints?

thanks.

~Fendi

Sorry, I didn’t read thoroughly the first time. The only way you could do that is that you launch just one block for do all the work:

unsigned threadId = threadIdx.x;

    unsigned icol = threadId;

    unsigned numberThreads = blockDim.x;

    unsigned numberCycles = (WIDTH - 1)/numberThreads + 1;

    unsigned i, j, k;

    unsigned irow = 0;

// First Row

    if (threadId == 0)

        matrix[0] = 10;

for (i = 1; i < numberThreads; i++) {

        if (i == threadId && icol < WIDTH) {

            matrix[icol] = matrix[icol - 1] + 10;

        };

        __syncthreads();

    };

icol += numberThreads;

for (j = 1; j < numberCycles; j++) {

        for (i = 0; i < numberThreads; i++) {

            if (i == threadId && icol < WIDTH) {

                matrix[icol] = matrix[icol - 1] + 10;

            };

            __syncthreads();

        };

        icol += numberThreads;

    };

icol = threadId;

    irow += WIDTH;

// The rest of rows.

    for (k = 1; k < HEIGHT; k++) {

        for (j = 0; j < numberCycles; j++) {

            for (i = 0; i < numberThreads; i++) {

                if (i == threadId && icol < WIDTH) {

                    matrix[irow + icol] = matrix[irow + (icol - 1)] + 10;

                };

                __syncthreads();

            };

            icol += numberThreads;

        };

        icol = threadId;

        irow += WIDTH;

    };

An surprisingly inefficient way of do it.

Hi, Luis, thanks again for the help. By the statement of only 1 block should doing all the processes, do you mean that I can only have a dimBlock(WIDTH,HEIGHT) and a dimGrid(1,1) ?

If I’m having a large size of WIDTH and HEIGHT, wouldn’t CUDA get pretty slow? Which means if I’m using CUDA for sequential processing, the processing speed won’t be delightful or even messed up.

I’ve tried your code, and I think I got the grasp of it.
When I tried to run the code on with buffer size of (50x50), I can still see some thing, and the result is in sequential order. But when I tried so use a larger buffer, say 1000x2000, it took so long for the CUDA to finish the process, and it even crash after some time (the program exit by itself).

I think this might be caused by the number of threads allowed on each block is very limited, say 256 or 512 according to different GPU type. Can you give me some insight about this? Is there a way for CUDA to do fast sequential processing?

Thanks, I really appreciate your help.

~Fendi

By design, we are targeting with Cuda massively parallel architecture. If your algorithm is intrinsically sequential you have two options :

  1. Run it on a CPU and forgot about Cuda.

  2. Change the algorithm to an equivalent parallel.

For instance in your example, an algorithm change suitable to Cuda would be :

*(d_Label + row * WIDTH + col) = (row * WIDTH + col + 1)*10;

You might also have a look at the parallelization of histogram for instance (CUDA Toolkit Documentation 12.3 Update 1) or at CUDPP(Google Code Archive - Long-term storage for Google Code Project Hosting.) for parallel prefix operations.

About your kernel that exit without completion, there’s a 5 seconds watchdog enabled if an X display is attached to the GPU.

Hi, Joky
Thanks for the suggestion.

about your 1st suggestion for giving up CUDA, it was my last resort. Since my system is pretty big, I have about 10 processes, where eight of them can be done in parallel, but 2 of them must be done in sequential, such that process 1-4 are parallel, 5 is sequential, 6-8 is parallel, 9 is sequential, and 10 is parallel.

I’m still thinking the possibility of changing the algorithm to parallel, but it seems to be no luck. The simple logic behind my algorithm is out[thread] = out[thread-1], that is the computation for the current output value requires the previous output to be calculated first.

I can just divide the processes into two, such as CUDA(for parallel) and CPU(for sequential), but it would require at least twice memory transfer from device to host which is somewhat time expensive.

So I’m wondering if there is a way to do all of them in CUDA with hope that the processing time can be pushed down to its limit.

Thanks for the link about histogram, I’ll take a good look on it.
If there is any additional information that you could share with me, I’d really appreciate it.
Thanks.

~Fendi

Start your sequential kernel with <<<1, 1>>> block/thread configuration. Obviously no need to use threadIdx/blockIdx, then, and you should resort to registers/shared memory as temporary storage.

Or if the configuration is <<<1, x>>> for whatever reason, the following can be used:

__syncthreads();

if (threadIdx.x == 0)

{

<sequential bit>

}

__syncthreads();