syncthreads problem I guess this is a syncthreads problem

I have a kernel

__global__ void sha1_kernel_global (unsigned char *data, sha1_gpu_context *ctx, int total_threads, unsigned long *extended)


        int thread_index = threadIdx.x + blockDim.x * blockIdx.x;

        int e_index = thread_index * 80;

        int block_index = thread_index * 64;

        unsigned long temp, t;

       if (thread_index > total_threads -1)


       GET_UINT32_BE( extended[e_index    ], data + block_index,  0 );

        GET_UINT32_BE( extended[e_index + 1], data + block_index,  4 );

        GET_UINT32_BE( extended[e_index + 2], data + block_index,  8 );

        GET_UINT32_BE( extended[e_index + 3], data + block_index, 12 );

        GET_UINT32_BE( extended[e_index + 4], data + block_index, 16 );

        GET_UINT32_BE( extended[e_index + 5], data + block_index, 20 );

        GET_UINT32_BE( extended[e_index + 6], data + block_index, 24 );

        GET_UINT32_BE( extended[e_index + 7], data + block_index, 28 );

        GET_UINT32_BE( extended[e_index + 8], data + block_index, 32 );

        GET_UINT32_BE( extended[e_index + 9], data + block_index, 36 );

        GET_UINT32_BE( extended[e_index +10], data + block_index, 40 );

        GET_UINT32_BE( extended[e_index +11], data + block_index, 44 );

        GET_UINT32_BE( extended[e_index +12], data + block_index, 48 );

        GET_UINT32_BE( extended[e_index +13], data + block_index, 52 );

        GET_UINT32_BE( extended[e_index +14], data + block_index, 56 );

        GET_UINT32_BE( extended[e_index +15], data + block_index, 60 );

       for (t = 16; t < 80; t++) {

                        temp = extended[e_index + t - 3] ^ extended[e_index + t - 8] ^

                                extended[e_index + t - 14] ^ extended[e_index + t - 16];

                        extended[e_index + t] = S(temp,1);



       if (thread_index == total_threads - 1) {

                for (t = 0; t < total_threads; t++)

                        sha1_gpu_process (ctx, (unsigned long*)&extended[t * 80]);



And it is executed something like this:

if (k - 1 > 0) {

                for (i = 0; i < k; i++) {

        //              printf ("offset: %d\n", total_threads * i * 64);

                        sha1_kernel_global <<<blocks_per_grid, threads_per_block>>>(d_message + threads_per_block * i * 64, d_ctx, threads_per_block, d_extended);

                        //CUDA_SAFE_CALL (cudaThreadSynchronize());



I use only one block per grid. If i process the data with 128 or less threads I get the results as I want, but the computation is really slow. If thread number is greater than 128 the algorithm finishes extremely fast, however the results are wrong. I just don’t understand how this can happen. I guess there is something with syncthreads ? Or maybe I am wrong. Any hints? Thank you.


Yeah, one block per grid will be slow. That only exercises a tiny fraction of the GPUs capability. Dozens of blocks can be processed in the same time as one, and you need 100’s to thousands to reach peak utilization.

Probably because of an error. Are you checking for CUDA errors after the kernel launch? This is most likely because your kernel will use more resources then are available with larger block sizes.

It seems you have many uncoalesced memory operations, this will kill your performance but shouldn’t be the reason the program fails with a bigger blocksize. Is it possible that your offset indexing leads to reads/writes outside of arrays sometimes?

Yes, I have checked for errors with CUT_CHECK_ERROR, if this is the way to check them. And I have no errors.

I am testing algorithm with different data sizes and different number of threads. Sizes are 3, 56, 1000, 10000, 100000, 1000000, 10000000, 100000000. I executed the kernel with 512 threads, and it gave correct results for data of size 3, 56 and 1000. I can’t see anything reasonable why it does not work with other sizes. I checked everything very carefully and I don’t see where the mistake is.

I also tested the algorithm in device emulation mode and it works as expected. All results are correct.

Are you compiling in debug mode? CUT_CHECK_ERROR does nothing if you compile in release mode.

I only suggested a kernel launch error because you mentioned that the kernel returned instantly with incorrect results. An instantly returning kernel is usually a sure sign of a kernel launch error.

You are right, I forgot to turn on _DEBUG :(. Now I have this error:

Cuda error: Kernel execution failed in file ‘’ in line 238 : too many resources requested for launch.

At this point I am trying to use 157 threads per block. And the limit for one block is 512. Are there any limits per thread? I don’t use a lot of local thread memory, I just have few temporary variables and the rest of the data is in the global memory. Thanks.

Just because the maximum size is 512 doesn’t mean your kernel can be run to the maximum. The condition that number of regs/thread * num_threads_per_block <= num_regs_per_MP needs to be satisfied. If you increase the number of threads/block beyond this limit, then you are requesting more resources than are available and get the error message that says exactly that.

How can I calculate how many registers I am using?

It is not a quantity that you can calculate, it is a measurable quantity of the output of the compiler. See . There is also an option something like -ptxas-options=-v in recent versions of nvcc (search the forums for the exact syntax) that will provide the same information.