syncthreads problem I guess this is a syncthreads problem

tadasv20 · October 11, 2008, 12:33am

I have a kernel

__global__ void sha1_kernel_global (unsigned char *data, sha1_gpu_context *ctx, int total_threads, unsigned long *extended)

{

        int thread_index = threadIdx.x + blockDim.x * blockIdx.x;

        int e_index = thread_index * 80;

        int block_index = thread_index * 64;

        unsigned long temp, t;

       if (thread_index > total_threads -1)

                return;

       GET_UINT32_BE( extended[e_index    ], data + block_index,  0 );

        GET_UINT32_BE( extended[e_index + 1], data + block_index,  4 );

        GET_UINT32_BE( extended[e_index + 2], data + block_index,  8 );

        GET_UINT32_BE( extended[e_index + 3], data + block_index, 12 );

        GET_UINT32_BE( extended[e_index + 4], data + block_index, 16 );

        GET_UINT32_BE( extended[e_index + 5], data + block_index, 20 );

        GET_UINT32_BE( extended[e_index + 6], data + block_index, 24 );

        GET_UINT32_BE( extended[e_index + 7], data + block_index, 28 );

        GET_UINT32_BE( extended[e_index + 8], data + block_index, 32 );

        GET_UINT32_BE( extended[e_index + 9], data + block_index, 36 );

        GET_UINT32_BE( extended[e_index +10], data + block_index, 40 );

        GET_UINT32_BE( extended[e_index +11], data + block_index, 44 );

        GET_UINT32_BE( extended[e_index +12], data + block_index, 48 );

        GET_UINT32_BE( extended[e_index +13], data + block_index, 52 );

        GET_UINT32_BE( extended[e_index +14], data + block_index, 56 );

        GET_UINT32_BE( extended[e_index +15], data + block_index, 60 );

       for (t = 16; t < 80; t++) {

                        temp = extended[e_index + t - 3] ^ extended[e_index + t - 8] ^

                                extended[e_index + t - 14] ^ extended[e_index + t - 16];

                        extended[e_index + t] = S(temp,1);

        }

        __syncthreads();

       if (thread_index == total_threads - 1) {

                for (t = 0; t < total_threads; t++)

                        sha1_gpu_process (ctx, (unsigned long*)&extended[t * 80]);

        }

}

And it is executed something like this:

if (k - 1 > 0) {

                for (i = 0; i < k; i++) {

        //              printf ("offset: %d\n", total_threads * i * 64);

                        sha1_kernel_global <<<blocks_per_grid, threads_per_block>>>(d_message + threads_per_block * i * 64, d_ctx, threads_per_block, d_extended);

                        //CUDA_SAFE_CALL (cudaThreadSynchronize());

                }

        }

I use only one block per grid. If i process the data with 128 or less threads I get the results as I want, but the computation is really slow. If thread number is greater than 128 the algorithm finishes extremely fast, however the results are wrong. I just don’t understand how this can happen. I guess there is something with syncthreads ? Or maybe I am wrong. Any hints? Thank you.

Tadas

MisterAnderson42 · October 11, 2008, 1:56am

Yeah, one block per grid will be slow. That only exercises a tiny fraction of the GPUs capability. Dozens of blocks can be processed in the same time as one, and you need 100’s to thousands to reach peak utilization.

Probably because of an error. Are you checking for CUDA errors after the kernel launch? This is most likely because your kernel will use more resources then are available with larger block sizes.

_Big_Mac · October 11, 2008, 8:33am

It seems you have many uncoalesced memory operations, this will kill your performance but shouldn’t be the reason the program fails with a bigger blocksize. Is it possible that your offset indexing leads to reads/writes outside of arrays sometimes?

tadasv20 · October 11, 2008, 1:30pm

Yes, I have checked for errors with CUT_CHECK_ERROR, if this is the way to check them. And I have no errors.

I am testing algorithm with different data sizes and different number of threads. Sizes are 3, 56, 1000, 10000, 100000, 1000000, 10000000, 100000000. I executed the kernel with 512 threads, and it gave correct results for data of size 3, 56 and 1000. I can’t see anything reasonable why it does not work with other sizes. I checked everything very carefully and I don’t see where the mistake is.

tadasv20 · October 11, 2008, 1:59pm

I also tested the algorithm in device emulation mode and it works as expected. All results are correct.

MisterAnderson42 · October 11, 2008, 5:45pm

Are you compiling in debug mode? CUT_CHECK_ERROR does nothing if you compile in release mode.

I only suggested a kernel launch error because you mentioned that the kernel returned instantly with incorrect results. An instantly returning kernel is usually a sure sign of a kernel launch error.

tadasv20 · October 11, 2008, 8:59pm

You are right, I forgot to turn on _DEBUG :(. Now I have this error:

Cuda error: Kernel execution failed in file ‘sha1test.cu’ in line 238 : too many resources requested for launch.

At this point I am trying to use 157 threads per block. And the limit for one block is 512. Are there any limits per thread? I don’t use a lot of local thread memory, I just have few temporary variables and the rest of the data is in the global memory. Thanks.

MisterAnderson42 · October 12, 2008, 12:08am

Just because the maximum size is 512 doesn’t mean your kernel can be run to the maximum. The condition that number of regs/thread * num_threads_per_block <= num_regs_per_MP needs to be satisfied. If you increase the number of threads/block beyond this limit, then you are requesting more resources than are available and get the error message that says exactly that.

tadasv20 · October 12, 2008, 12:16am

How can I calculate how many registers I am using?

MisterAnderson42 · October 12, 2008, 12:40pm

It is not a quantity that you can calculate, it is a measurable quantity of the output of the compiler. See [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA . There is also an option something like -ptxas-options=-v in recent versions of nvcc (search the forums for the exact syntax) that will provide the same information.

Topic		Replies	Views
Run a million threads or blocks on a single kernel function, and still works. It supposed to be 512 at maximum, isn't it? CUDA Programming and Performance	4	1319	January 6, 2017
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2276	October 12, 2021
Launching Kernel Fail CUDA Programming and Performance	15	3412	May 28, 2014
Cuda code performance CUDA Programming and Performance	14	3163	December 16, 2014
How can I be certain my Kernel runs with 32 threads in one block and thus perfect synchrony? (ie. via __syncthreads()) CUDA Programming and Performance	15	50	August 21, 2024
Wrong output when adding blocks what am I doing wrong? CUDA Programming and Performance	13	11980	December 4, 2007
Max threads/block CUDA Programming and Performance	10	22221	March 7, 2011
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3983	February 9, 2024
CUDA kernels keep on crashing CUDA Programming and Performance	6	3649	October 27, 2008
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7963	June 14, 2013

syncthreads problem I guess this is a syncthreads problem

Related topics