Thrust remove_if memory corruption

In tracking down a problem with more complicated code using a copy_if() and transform iterators, I simplified a test down to a simple call to remove_if() which seems to exhibit the same memory corruption problem.

struct T
{
float m1[16];
float m2[12];
unsigned int i;
};

struct AlwaysFalse { device bool operator()(const T&) const { return false; } };

T* start; // device allocated array of 512 items, aligned to 512 bytes
thrust::remove_if(thrust::device, start, start + 512, AlwaysFalse());

I get corrupted elements at the start of exactly every 128 entries, so at array indices 0, 128, 256, 384. The problem goes away in debug builds. I’ve tried various alignment and padding of the struct T to no avail also.

Does anybody have any idea why this may be?

CUDA 9.2, VS2017 15.7.3, Win 7, GTX 1080Ti, Driver 398.11

In debug build (where the problem doesn’t happen) stepping through on the host I see the following:

copy_if.h

cuda_cub::copy_if::doit_step

tile_size = 128
num_tiles = 4
vshmem_size = 0
allocation_sizes[0] = 288

cuda_cub::copy_if::copy_if

storage_size = 1279
temp_storage_bytes = 767
num_items = 512
allocation_sizes[2] = {4, 767}
allocations[2] = {0x0000000b29000000, 0x0000000b29000100}
ptr = 0x0000000b29000000

cuda_cub::copy_if::doit_step

init_plan.block_threads = 128
init_plan.items_per_thread = 1
init_plan.items_per_tile = 128
init_plan.shared_memory_size = 0
init_plan.grid_size = 0
copy_if_plan.block_threads = 128
copy_if_plan.items_per_thread = 1
copy_if_plan.items_per_tile = 128
copy_if_plan.shared_memory_size = 16384
copy_if_plan.grid_size = 0
num_selected = 512

So the 128 items being allocated to one block seems suspiciously related to where I see the periodic corruption in the array of items.

You could raise this issue also in the thrust-users mailing list

https://groups.google.com/forum/#!forum/thrust-users

alternatively file a bug report at https://developer.nvidia.com/ and be sure to attach your repro code. (you need a to have a registered developer account, which is free but requires approval)

Christian

Ok will do. The definition for struct T was wrong that I posted, and it seems to be the issue for some reason. The problem manifests if the struct is this:

	#ifdef __NVCC__
		#define S_ALIGN(n)	__align__(n)
	#else
		#define S_ALIGN(n)	__declspec(align(n))
	#endif


	struct S_ALIGN(16) V4 { float n[4]; };

	struct Test
	{
		V4 r[4];
		float k[4 * 3];
		int i;
	};

If I remove the S_ALIGN(16) from the struct definition it works fine?

Seeing as it seems that MSVC and NVCC both now accept alignas() I also tried:

struct alignas(16) V4 { float n[4]; };

struct Test
{
V4 r[4];
float k[4 * 3];
int i;
};

Which also shows the same problem.