Changing from __shared__ to __shfl()

Hi,

I’ve learned a bit of CUDA by porting an OpenCL based crypto-hashing kernel to CUDA. Got it working, even got it working faster than its OpenCL counterpart. But I want to see if I can stretch it a bit more by replacing shared memory with __shfl. I have implemented the change but my kernel just crashes hard and I haven’t got a clue where to find the error.

First I have the following custom types:

typedef union
{
	uint64_t uint64s[32 / sizeof(uint64_t)];
	uint32_t uint32s[32 / sizeof(uint32_t)];
} hash32_t;

typedef union
{
	uint64_t uint64s[64 / sizeof(uint64_t)];
	uint4	 uint4s[64 / sizeof(uint4)];
} hash64_t;

typedef union
{
	hash64_t init;	
	hash32_t mix;
} compute_hash_share;

And here I have the original use of shared

__device__ hash32_t compute_hash(
	hash32_t const* g_header,
	hash128_t const* g_dag,
	uint64_t nonce
	)
{
	extern __shared__  compute_hash_share share[];

	// Compute one init hash per work item.
	hash64_t init = init_hash(g_header, nonce);

	// Threads work together in this phase in groups of 8.
	uint32_t const thread_id = threadIdx.x & (THREADS_PER_HASH-1);
	uint32_t const hash_id   = threadIdx.x >> 3;

	hash32_t mix;
	uint32_t i = 0;
	
	do
	{
		// share init with other threads
		if (i == thread_id)
			share[hash_id].init = init;
		
		uint4 thread_init = share[hash_id].init.uint4s[thread_id & 3];
		
		uint32_t thread_mix = inner_loop(thread_init, thread_id, share[hash_id].mix.uint32s, g_dag);

		share[hash_id].mix.uint32s[thread_id] = thread_mix;
		

		if (i == thread_id)
			mix = share[hash_id].mix;
		

	} while (++i != THREADS_PER_HASH );

	return final_hash(&init, &mix);
}

So what happens here is that a warp is basically divived in 4 groups of 8 threads, and on each iteration of the do-while loop, these 8 threads simultaneously work out 1 of the 8 uint32s of share.mix. Brilliant. Somebody else worked that out btw.

Then I thoughr how to replace that with __shfl? So instead of sharing the init hash with other threads, I thought I’d shuffle them in on the current thread:

__device__ hash32_t compute_hash_shuffle(
	hash32_t const* g_header,
	hash128_t const* g_dag,
	uint64_t nonce
	)
{
	compute_hash_share share;

	// Compute one init hash per work item.
	hash64_t init = init_hash(g_header, nonce);

	// Threads work together in this phase in groups of 8.
	uint32_t const thread_id = threadIdx.x & (THREADS_PER_HASH - 1);
	uint32_t const hash_id = threadIdx.x >> 3;

	hash32_t mix;
	int i = 0;

	do
	{

		// read init from other thread

		if (i == thread_id) 
			share.init = init;
		else {
			share.init.uint4s[0] = __shfl(init.uint4s[0], i, THREADS_PER_HASH);
			share.init.uint4s[1] = __shfl(init.uint4s[1], i, THREADS_PER_HASH);
			share.init.uint4s[2] = __shfl(init.uint4s[2], i, THREADS_PER_HASH);
			share.init.uint4s[3] = __shfl(init.uint4s[3], i, THREADS_PER_HASH);
		}
		
		uint4 thread_init = share.init.uint4s[thread_id & 3];

		uint32_t thread_mix = inner_loop(thread_init, thread_id, share.mix.uint32s, g_dag);

		share.mix.uint32s[thread_id] = thread_mix;
		
		if (i == thread_id)
			mix = share.mix;

	} while (++i != THREADS_PER_HASH);

	return final_hash(&init, &mix);
}

and this:

__device__ uint4 __shfl(uint4 val, unsigned int lane, int warpSize)
{
	return make_uint4(
		__shfl((int)val.x, lane, warpSize),
		__shfl((int)val.y, lane, warpSize),
		__shfl((int)val.z, lane, warpSize),
		__shfl((int)val.w, lane, warpSize));
}

I understand this may not be enough info to help me out, but I was first of all wondering if this is “legal” use of __shfl in the first place? The kernel has unspecified launch failures with this kernel, unless I skip the “inner_loop”. I could post it here, but you might better look here:

https://github.com/Genoil/cpp-ethereum/blob/cudaminer/libethash-cu/ethash_cu_miner_kernel.cu

Thanks for having a look at this.

Let’s improve your debugging skills, slightly.

Follow the method outlined in the answer here:

http://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy

In a nutshell:

  1. compile your code with the -lineinfo switch

  2. run your code with cuda-memcheck

cuda-memcheck should not only show the type of error in the kernel, but should identify the specific line of kernel code that the error occurred on.

Thanks for your answer. I’ve run with cuda-memcheck and the causes for the crashes are now clear to me. The hashing algo basically works like this: First the header of a block we’re trying to find a valid nonce for is hashed together with the nonce using SHA-256-3 (a.k.a. Keccak). This is the starting point for a 128-fold lookup in a precalculated ~1GB file called the DAG (that resides on the GPU and ensures memory hardness of the algo). Then the final result of that is hashed once more. The index into the DAG on these lookups runs out of bounds using my __shfl implementation.

Ok got it working. Only, it’s (marginally) slower than the shared memory approach. Are there any standard approaches for optimizing __shfl-ed code? Like what to look for in PTX code, etc.?

–edit–

Upon inspection of the PTX, I noticed most of my shfl.idx instructions are followed by a st.local, rather than keeping everything inside the registers. Is there a way I can force this 64-byte struct that I’m shuffling around to stay inside the registers?

Fully unroll your do…while loop (e.g. #pragma unroll)

Any run time access with non constant index variables will forces an array to local memory.
If all indices are known at compile time, arrays can go to registers (assuming they are not too big).

Christian

I “solved” it in a different way: got rid of all the custom types with unions of arrays and just declared a lot of separate uint32s to shuffle around. I’m now back at exactly the same speed as with the shared approach. The bottleneck is clearly in a different part of the code…oh well it was good exercise :)

btw adding #pragma unrolls to this code only made it slower