Strange texture fetch behavior (release vs debug)

I recently added support for transparency in the pathtracer I’ve been working on.
The following piece of code is handles transparency in the ray/triangle intersection routine.

CODE 1:

if ( alphaMasked() ) { 
//The type of _device_texture_indices_texobj is cudaTextureObject_t and it resides in the constant memory.
	const uint4 tex_idx = tex1Dfetch<uint4>( _device_texture_indices_texobj, tri_idx );
	// the forth component of the tex_idx is the index of the alpha mask image texture
	// tex_object is array of cudaTextureObject_t referencing all image textures (diffuse, spec, bump, and alpha)
	const float alpha = tex2D<float4>( tex_objects[tex_idx.w], uv.x, 1.0f - uv.y ).x;
	if ( alpha != 1.0f ) {return false;}
}

tex_idx contains indices for diffuse, specular, bump, and alpha image textures per triangle.
I made a simple tree leaf as a test case.
The leaf is modeled as a quad (two triangles) with a diffuse and an alpha texture only.

The tex_idx for both triangles looks like:
tex_idx.x = 1 (diffuse)
tex_idx.y = 0 (no specular)
tex_idx.z = 0 (no bump)
tex_idx.w = 2 (alpha)
for the leaf test case tex_idx.x and tex_idx.w are guaranteed to be 1 and 2 respectively.
The following is the correct rendering for the leaf test case:

https://drive.google.com/open?id=0B2Cl1FjxGIekVnpmaWdlc3RDbE0

CODE 1 (in debug build) produces the correct redering.

CODE 1 (in release build) produces the following incorrect rendering:
External Media
https://drive.google.com/open?id=0B2Cl1FjxGIekWlJtYV9Ra1NFbzQ
(wrong transparency for the second triangle)

However, the above code works fine if I hard code the index for the alpha mask (notice change on line 6):
tex_objects[tex_idx] is changed to tex_objects[2]
CODE 2:

if ( alphaMasked() ) { 
	//The type of _device_texture_indices_texobj is cudaTextureObject_t and it resides in the constant memory.
	const uint4 tex_idx = tex1Dfetch<uint4>( _device_texture_indices_texobj, tri_idx );
	// the forth component of the tex_idx is the index of the alpha mask image texture
	// tex_object is array of cudaTextureObject_t referencing all image textures (diffuse, spec, bump, and alpha)
	const float alpha = tex2D<float4>( tex_objects[2], uv.x, 1.0f - uv.y ).x;
	if ( alpha != 1.0f ) {return false;}
}

Here’s the interesting part. I added a simple check to print an error message if tex_idx.w is not 2. The message on
line 4 is never printed and incorrect rendering is displayed.
CODE 3:

if ( alphaMasked() ) { 
	//The type of _device_texture_indices_texobj is cudaTextureObject_t and it resides in the constant memory.
	const uint4 tex_idx = tex1Dfetch<uint4>( _device_texture_indices_texobj, tri_idx );
	if (tex_idx.w != 2) printf(" ERROR tex_idx=%u\n", tex_idx.w);
	// the forth component of the tex_idx is the index of the alpha mask image texture
	// tex_object is array of cudaTextureObject_t referencing all image textures (diffuse, spec, bump, and alpha)
	const float alpha = tex2D<float4>( tex_objects[2], uv.x, 1.0f - uv.y ).x;
	if ( alpha != 1.0f ) {return false;}
}

As I mentioned, the original CODE 1 creates the correct rendering in Debug mode. No shared memory is used anywhere in the code.

Even though shared memory is never used, the behavior I described above still points to sync
issues somewhere, so for the heck of it I added __syncthreads() to CODE 1 and it “fixed” the problem.
CODE 5:

if ( alphaMasked() ) { 
	//The type of _device_texture_indices_texobj is cudaTextureObject_t and it resides in the constant memory.
	const uint4 tex_idx = tex1Dfetch<uint4>( _device_texture_indices_texobj, tri_idx );
    __syncthreads()
	// the forth component of the tex_idx is the index of the alpha mask image texture
	// tex_object is array of cudaTextureObject_t referencing all image textures (diffuse, spec, bump, and alpha)
	const float alpha = tex2D<float4>( tex_objects[tex_idx.w], uv.x, 1.0f - uv.y ).x;
	if ( alpha != 1.0f ) {return false;}
}

I’m still baffled by why sync is somehow fixing the issue. I don’t believe __syncthreads() is the correct solution in this case.
Any ideas would greatly be appreciated.

The following is a description of my hardware and build environment:
GPU: Nvidia Titan Z (Kepler) driver 378.66 (same issue with driver that came with CUDA 8 toolkit)
OS: Windows 10
CUDA: 8.0
IDE: Visual Studio 2015
Microsoft Visual Studio Community 2015 Version 14.0.25431.01 Update 3

Just in case someone else has encountered the same problem:

I submitted a but report to the CUDA support team, in March 22nd of 2017. The bug is fixed (I tested it) in CUDA toolkit 9.0 rc.