cub::BlockLoad not working

Hello all,

I’m trying to use cub::BlockLoad to load 4 items from global memory per thread in a kernel. Although this kernel successfully accesses global memory, BlockLoad appears to fail to then load these into the thread local array. Below is a minimum working example.

// to remove squiggly Intellisense lines
#ifndef __CUDACC__
#define __CUDACC__
#endif

// CUDA headers
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

// cub library
#include "cub/cub.cuh"

// C/C++ headers
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define BLOCK_SIZE 256

#define NUM_BLOCKS 100

#define ITEMS_PER_THREAD 4

void checkCUDAError(const char* msg)
{
	cudaError_t err = cudaGetLastError();
	if (cudaSuccess != err)
	{
		fprintf(stderr, "CUDA error: %s: %s.\n", msg, cudaGetErrorString(err));
		exit(EXIT_FAILURE);
	}
}

// myStruct is not used as of yet
typedef struct myStruct
{
	double* a;

} myStruct;


__global__ void myKernel(int* d_in)
{
	int x = blockIdx.x * blockDim.x + threadIdx.x;

	if (x >= BLOCK_SIZE * NUM_BLOCKS) return;
		
	typedef cub::BlockLoad<int, BLOCK_SIZE, ITEMS_PER_THREAD> BlockLoad;

	__shared__ typename BlockLoad::TempStorage temp_storage;

	int local[4];

	BlockLoad(temp_storage).Load(d_in, local);

	__syncthreads();

	printf("Thread: %d, value of item 0: %d\n", x, local[0]);
}

int main()
{
	int* h_in  = new int[BLOCK_SIZE * NUM_BLOCKS];
	int* d_in;

	cudaMalloc( &d_in, BLOCK_SIZE * NUM_BLOCKS * sizeof(int) );
	checkCUDAError("malloc");
	
	for (int i = 0; i < BLOCK_SIZE * NUM_BLOCKS; i++)
	{
		h_in[i] = i % 4;
	}

	cudaMemcpy
	(
		d_in,
		h_in,
		BLOCK_SIZE * NUM_BLOCKS * sizeof(int),
		cudaMemcpyDefault
	);
	checkCUDAError("copy");

	myKernel<<<NUM_BLOCKS, BLOCK_SIZE>>>(d_in);

	delete[] h_in;

	cudaFree(d_in);

	return 0;
}

As far as I can tell, this is virtually identical to the example on the page I have linked. However, it simply prints:

Thread: 3872, value of item 0: 0
Thread: 3873, value of item 0: 0
Thread: 3874, value of item 0: 0
Thread: 3875, value of item 0: 0
Thread: 3876, value of item 0: 0
Thread: 3877, value of item 0: 0
Thread: 3878, value of item 0: 0
Thread: 3879, value of item 0: 0
Thread: 3880, value of item 0: 0
Thread: 3881, value of item 0: 0
...

Until the kernel finishes.

I have been working on this seemingly trivial issue for the best part of a day, within a larger code. After I discovered that even this simple sandbox problem wasn’t working I was compelled to ask for advice, so any help would be very appreciated.

For reference, I am using Visual Studio, Notebook GTX 1050 (Compute Capability 6.1), CUDA Toolkit 11.1. Please let me know if any other information is needed and I’ll do my best provide it.

Thanks in advance!

Study these two lines carefully:

	h_in[i] = i % 4;

and

printf("Thread: %d, value of item 0: %d\n", x, local[0]);

If you need an additional clue, try changing the local index to 1 in your printout:

printf("Thread: %d, value of item 1: %d\n", x, local[1]);

Aside:

I don’t recommend this coding practice:

// to remove squiggly Intellisense lines
#ifndef __CUDACC__
#define __CUDACC__
#endif

If you want to manage intellisense, I suggest wrapping that with:

// to remove squiggly Intellisense lines
#ifdef __INTELLISENSE__
#ifndef __CUDACC__
#define __CUDACC__
#endif
#endif
1 Like

Hi, thank you very much for your quick response (and for all your other posts generally!).

I spotted my mistake, I can see that BlockLoad is working fine. May I please turn your attention now then to the main issue I have been working on, condensed below:

template< bool SINGLE_BLOCK>
__global__ void encode_and_threshold
(
	....
        index_1D          next_lvl_idx,
	....
)
{
	__shared__ real shared_coeffs[NUM_VARS * THREADS_PER_BLOCK_MRA];
	
	...

	index_1D t_idx = threadIdx.x;
	index_1D idx   = blockIdx.x * blockDim.x + t_idx;

	if (idx >= num_threads) return;

	...

	if (SINGLE_BLOCK)
	{
		// ...
	}
	else
	{
		real* z_global = &d_scale_coeffs.z[next_lvl_idx];

		//printf("Global value: %f\n", z_global[idx]);

		typedef cub::BlockLoad<real, THREADS_PER_BLOCK_MRA, 4> BlockLoad;

		__shared__ typename BlockLoad::TempStorage temp_storage;

		real z_local[4];

		BlockLoad(temp_storage).Load(z_global, z_local);

		__syncthreads();

		printf("Thread: %d, local value: %f", idx, z_local[0]);
		printf("Thread: %d, local value: %f", idx, z_local[1]);
		printf("Thread: %d, local value: %f", idx, z_local[2]);
		printf("Thread: %d, local value: %f", idx, z_local[3]);     
		// ...		
	}
}

void dummy_template_instantiator_encode_and_threshold
(
	[args]
)
{
	encode_and_threshold<true><<<1, THREADS_PER_BLOCK_MRA>>>
	(
		[args]
	);

	encode_and_threshold<false><<<1, THREADS_PER_BLOCK_MRA>>>
	(
		[args]
	);
}

Here, I get all zeros even though print z_global gives me the values I expected.

For reference, real is an alias for double, index_1D is an alias for indexing ‘flattened’ arrays and next_level_idx is an offset I need for my array (encapsulated in a struct) in global memory. Thanks again, please let me know if you need any clarification.

I’m not a big fan of incomplete codes. I generally have to remind myself to ignore posts that ask about debugging an incomplete code. But anyway, I was unable to spot any issues.

Here’s my attempt at building a complete code out of what you have shown:

$ cat t26.cu
#include <cstdio>
#include <cub/cub.cuh>
const int THREADS_PER_BLOCK_MRA = 256;
const int NUM_VARS=4;
template< bool SINGLE_BLOCK>
__global__ void encode_and_threshold
(
        double *d, int lim
)
{
        __shared__ double shared_coeffs[NUM_VARS * THREADS_PER_BLOCK_MRA];

        int t_idx = threadIdx.x;
        int  idx  = blockIdx.x * blockDim.x + t_idx;
                double * z_global = d;
                if (idx < lim)
                  printf("Global value: %f\n", z_global[idx]);

                typedef cub::BlockLoad<double, THREADS_PER_BLOCK_MRA, 4> BlockLoad;

                __shared__ typename BlockLoad::TempStorage temp_storage;

                double z_local[4];

                BlockLoad(temp_storage).Load(z_global, z_local);

                __syncthreads();
                if (idx < lim){

                  printf("Thread: %d, local value: %f\n", idx, z_local[0]);
                  printf("Thread: %d, local value: %f\n", idx, z_local[1]);
                  printf("Thread: %d, local value: %f\n", idx, z_local[2]);
                  printf("Thread: %d, local value: %f\n", idx, z_local[3]);
                }
}

void dummy_template_instantiator_encode_and_threshold
(
    double *d
)
{
        encode_and_threshold<false><<<1, THREADS_PER_BLOCK_MRA>>>
        (
                d,2
        );
}

int main(){

    double *d, *h;
    const int ds = THREADS_PER_BLOCK_MRA*4;
    cudaMalloc(&d, ds*sizeof(double));
    h = new double[ds];
    for (int i = 0; i < ds; i++) h[i] = (i%4)+1;
    cudaMemcpy(d, h, ds*sizeof(double), cudaMemcpyHostToDevice);
    dummy_template_instantiator_encode_and_threshold(d);
    cudaDeviceSynchronize();
}
$ nvcc -o t26 t26.cu
t26.cu(11): warning: variable "shared_coeffs" was declared but never referenced

$ cuda-memcheck ./t26
========= CUDA-MEMCHECK
Global value: 1.000000
Global value: 2.000000
Thread: 0, local value: 1.000000
Thread: 1, local value: 1.000000
Thread: 0, local value: 2.000000
Thread: 1, local value: 2.000000
Thread: 0, local value: 3.000000
Thread: 1, local value: 3.000000
Thread: 0, local value: 4.000000
Thread: 1, local value: 4.000000
========= ERROR SUMMARY: 0 errors
$

It seems to work correctly. Therefore I conclude the most likely possibility is that the problem lies in something you haven’t shown. I think there is a reasonable possibility that if you create a minimal but complete example, you may well discover the problem yourself.

I thought to leave out most of my code for brevity’s sake, but I will do a better job from now to include all the necessary information e.g. macros or const int’s. Sorry about that.

I will compare your code with mine to see where the difference lies. Can I also ask if you suspect whether the array offset ‘next_lvl_idx’ could cause issues for BlockLoad? That is the immediate difference I see between your code and mine.

I have run cuda-memcheck and racecheck as well but have not had any errors.

Whatever the effect of next_lvl_idx usage may be, based on what I see of your code, it should already be factored into consideration through use of z_global. And since you are using z_global for the printf you refer to here:

as well as the subsequent code (BlockLoad, printf etc.), I don’t see why next_lvl_idx should matter. However I don’t know what I don’t know.

Let me pose the question to you: Why should next_lvl_idx matter? If you can print the values that z_global points to, why should BlockLoad not be able to load from z_global? Moreover, you now state that cuda-memcheck reports no issues, so your usage of z_global (based on your printf statement) seems fine.

Hello,

I created a working, complete example mimicking my main problem. My problem is essentially populating a quadtree, which I am representing using an array by flattening the tree. A quadtree of height L has (4^(L+1) - 1)/ 3 elements in total; using this relationship, we can calculate the index within the array at which a level n begins, as in the function get_lvl_idx (in my example).

In my problem, the information at level n of the quadtree is derived from the information at level n + 1. I am using the information from four elements at level n + 1 to populate a single element at level n. My example, below, appears to be running properly, as evidenced by the printf statement showing non-zero values. cuda-memcheck also reports no issues.

// RC's suggestion to manage Intellisense
#ifdef __INTELLISENSE__
#ifndef __CUDACC__
    #define __CUDACC__
#endif
#endif

// CUDA headers
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

// cub library
#include "cub/cub.cuh"

// C/C++ headers
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define CHECK_CUDA_ERROR(ans) { CUDAAssert((ans), __FILE__, __LINE__); }
inline void CUDAAssert(cudaError_t error, const char* file, int line, bool abort = true)
{
	if (error != cudaSuccess)
	{
		fprintf(stderr, "CUDA error: %s, %s, %d\n", cudaGetErrorString(error), file, line);
		if (abort) exit(error);
	}
}

const int BLOCK_SIZE = 256;

const int ITEMS_PER_THREAD = 4;

const int MAX_LVL = 6;

const int LVL_SINGLE_BLOCK = 4;

int get_lvl_idx(int level)
{
	return ((1 << (2 * level)) - 1) / 3;
}

int get_num_blocks(int num_threads, int block_size)
{
	return num_threads / block_size + (num_threads % block_size != 0);
}

typedef struct TreeStruct
{
	double* data;

} TreeStruct;

template<bool SINGLE_BLOCK>
__global__ void populate_lower_levels
(
	TreeStruct tree_struct,
	int        current_lvl_idx,
	int        next_lvl_idx,
	int        num_threads
)
{
	__shared__ double shared_coeffs[BLOCK_SIZE * 4];

	typedef cub::BlockLoad<double, BLOCK_SIZE, ITEMS_PER_THREAD> BlockLoad;

	__shared__ typename BlockLoad::TempStorage temp_storage;
	
	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	if (idx >= num_threads) return;

	int tree_idx = current_lvl_idx + idx;

	if (SINGLE_BLOCK)
	{
		tree_struct.data[idx] = 0.5 * idx;
	}
	else
	{
		double* next_lvl = &tree_struct.data[next_lvl_idx];
		
		//printf("Thread: %d, value: %f\n", idx, next_lvl[idx]);

		double local[ITEMS_PER_THREAD];

		BlockLoad(temp_storage).Load(next_lvl, local);

		__syncthreads();

		/*printf("Thread: %d, value: %f\n", idx, local[0]);
		printf("Thread: %d, value: %f\n", idx, local[1]);
		printf("Thread: %d, value: %f\n", idx, local[2]);
		printf("Thread: %d, value: %f\n", idx, local[3]);
		*/

		double a = local[0];
		double b = local[1];
		double c = local[2];
		double d = local[3];

		double sum = a + b + c + d;

		printf("Device tree element: %d, sum: %f\n", current_lvl_idx + idx, sum);
		
		__syncthreads();
		
		tree_struct.data[current_lvl_idx + idx] = sum;
	}
}

int main()
{
	int num_elem_tree = get_lvl_idx(MAX_LVL + 1);

	double* h_tree = new double[num_elem_tree];

	int current_lvl_idx = get_lvl_idx(MAX_LVL);
	int next_lvl_idx    = get_lvl_idx(MAX_LVL + 1);

	for (int i = current_lvl_idx; i < next_lvl_idx; i++)
	{
		h_tree[i] = 0.01 * i;
	}

	TreeStruct tree_struct;

	CHECK_CUDA_ERROR( cudaMalloc( &tree_struct.data, num_elem_tree * sizeof(double) ) );

	CHECK_CUDA_ERROR( cudaMemcpy(tree_struct.data, h_tree, num_elem_tree * sizeof(double), cudaMemcpyDefault) );

	printf("PRINTING DEVICE VALUES\n");

	for (int level = MAX_LVL - 1; level > LVL_SINGLE_BLOCK - 1; level--)
	{
		int prev_lvl_idx    = get_lvl_idx(level - 1);
		int current_lvl_idx = get_lvl_idx(level);
		int next_lvl_idx    = get_lvl_idx(level + 1);
		int num_threads     = next_lvl_idx - current_lvl_idx;
		int num_blocks      = get_num_blocks(num_threads, BLOCK_SIZE);

		populate_lower_levels<false><<<num_blocks, BLOCK_SIZE>>>
		(
			tree_struct, 
			current_lvl_idx, 
			next_lvl_idx, 
			num_threads
		);

		CHECK_CUDA_ERROR( cudaPeekAtLastError() );
		CHECK_CUDA_ERROR( cudaDeviceSynchronize() );	
	}

	int num_threads = get_lvl_idx(LVL_SINGLE_BLOCK + 1);
	int num_blocks  = get_num_blocks(num_threads, BLOCK_SIZE);

	populate_lower_levels<true><<<num_blocks, BLOCK_SIZE>>>
	(
		tree_struct, 
		current_lvl_idx, 
		next_lvl_idx, 
		num_threads
	);

	CHECK_CUDA_ERROR( cudaPeekAtLastError() );
	CHECK_CUDA_ERROR( cudaDeviceSynchronize() );

	CHECK_CUDA_ERROR(cudaMemcpy(h_tree, tree_struct.data, num_elem_tree * sizeof(double), cudaMemcpyDeviceToHost));

	printf("\n");

	printf("PRINTING HOST VALUES\n");

	for (int level = MAX_LVL - 1; level > 0; level--)
	{
		int prev_lvl_idx = get_lvl_idx(level - 1);
		int current_lvl_idx = get_lvl_idx(level);
		int next_lvl_idx = get_lvl_idx(level + 1);

		for (int j = current_lvl_idx; j < next_lvl_idx; j++)
		{
			printf("Host tree element: %d, sum: %f\n", j, h_tree[j]);
		}

		printf("\n");
	}

	delete[] h_tree;

	CHECK_CUDA_ERROR( cudaFree(tree_struct.data) );

	return 0;
}

Indeed, your question highlights my key frustration with my bug. z_global is evidently printing non-zero values as expected, but BlockLoad is not loading these values to z_local. Most bizarrely, please let me highlight a modified portion of the code in my previous post.

     real* z_global = &d_scale_coeffs.z[next_lvl_idx];

    //printf("Global value: %f\n", z_global[idx]);
	real z_local[4];

	BlockLoad(temp_storage).Load(z_global, z_local);
	__syncthreads();

	if ( z_local[0] || z_local[1] || z_local[2] || z_local[3] )
	{
		printf("Thread: %d, local value: %f\n", idx, z_local[0]);
		printf("Thread: %d, local value: %f\n", idx, z_local[1]);
		printf("Thread: %d, local value: %f\n", idx, z_local[2]);
		printf("Thread: %d, local value: %f\n", idx, z_local[3]);
	}

	__syncthreads();

Were it the case that in some other portion of my code I was erroneously modifiyng z_global somehow, such that the values are always zero, then the above should still print non-zero values. I thought to add an additional __syncthreads(); to make sure all the block-wide printing is done before any calculations, but nothing is printed since all the values in z_local are zero.

Understandably, I am baffled. At this point, I also tried to debug BlockLoad’s code but it is a bit dense for me.

Thank you for your help so far. I will be definitely be taking on your advice, among others to make working, complete, sandbox examples to narrow down issues when presenting my bug to others. Any further help would be really appreciated.

EDIT: Sorry, sum was mistakenly written as int instead of double.

The code that you have now posted generates over 100,000 lines (!) of printout. I have no idea how to interpret that. You say that this code mimics your main problem, but the majority of output lines that I see are printing a non-zero value. You also say your example appears to be running properly. In the cases where the printout of value is zero, I observe that next_lvl is also pointing to zero data, so I don’t think there is any surprise that block load is populating your local array with zeros. If that is the issue you are concerned about, you have an indexing problem, and it has nothing to do with block load from I can see. You’ve removed the print-out of next_lvl from this code (as compared to previous, where you are printing z_global prior to the block load.) If you add that back in, you’ll see consistency.

I’m not sure I can be of any assistance. I’m not sure you understand what I was suggesting. I have no idea how to interpret whether your code is doing something correct or not, based on the output, and I don’t have time to try to reverse engineer your intent.

Since you are evidently on windows, and some of your kernel launches are taking ~22s or more on my Tesla V100, I’ll mention that in general you should be on the lookout for WDDM TDR timeouts, if your GPU is in WDDM driver mode. Based on your descriptions so far, I doubt that is the issue here, but I’ll mention it anyway.

Sorry, my latest code was meant to be a counterexample in the sense that it mimics my original code in terms of indexing, but, in contrast to my original code, BlockLoad works.

The main issue I am facing is that z_global is being correctly accessed, but BlockLoad fails to load these (please see the second code snippet in my previous post). I don’t know if using real as template parameter would be causing any issues. It is:

#pragma once

typedef double real;
#define C(x) x

//typedef float real;
//#define C(x) x##f

Short of this, there is no difference between my sandbox example and my original code.

I will read into potential WDDM TDR timeouts.

EDIT: I edited my previous post to show additions to my sandbox example. In these outputs, you may observe that the device printouts when comparing identical tree elements are the same i.e. my example works.

I will mark the discussion as solved as short of posting my whole code I see no other way to describe the issue I am facing, which is that the array is being loaded from global memory correctly, but BlockLoad does not load it, even though in my example I am doing the exact same thing only for some reason it works.

This will eventually break. cub’s Block algorithms usually require all threads in the block to execute the collective. However, in the last block of the grid some threads may have already exited before the call.

Thank you for your reply. I have the if statement only for safety, but in fact, no threads ever return: num_threads is always equal to the length of the portion of the array I wish to process, by design.