Strange problem fetching 1D float4 texture

Hi all,

I have a strange problem when fetching float4 from 1D texture…

I need to load four 3D coordinates (then 12 floats) which are compacted into three float4.
This works fine when loading from global memory but when loading from texture, data is wrong.

Some code will help you to understand:

texture<float4, cudaTextureType1D, cudaReadModeElementType> t_data;
HANDLE_ERROR( cudaMalloc( &devData, nbData * sizeof(float4) ) );
HANDLE_ERROR( cudaMemcpy( devData, hostData, nbData * sizeof(float4), cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaBindTexture( 0, t_data, devData, cudaCreateChannelDesc<float4>(), nbData * sizeof(float4) ) );

Then in my kernel:

const float4 f0 = tex1Dfetch( t_data, id );
const float4 f1 = tex1Dfetch( t_data, id + 1 );
const float4 f2 = tex1Dfetch( t_data, id + 2 );

The problem is that f2.x differs from the data in global memory and is always equal to f2.w !!!
Moreover, it works fine when compiling in debug mode, it appears only in release mode.
The problem is the same using either texture references or texture objects…

I really don’t understand, I develop CUDA applications since several years and I have never encountered such a problem.

I’m using CUDA 7.0 with Nsight Visual Studio Edition 4.6.

Hoping someone could help me.

I notice you have both nbData and nb_data being used, but maybe that is intentional.

Anyway, I don’t see a problem with the code you have posted:

$ cat t760.cu
#include <stdio.h>
#include <vector_types.h>
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

texture<float4, cudaTextureType1D, cudaReadModeElementType> t_data;

__global__ void mykernel(){
  int id = threadIdx.x+blockDim.x*blockIdx.x;
  const float4 f0 = tex1Dfetch( t_data, id );
  const float4 f1 = tex1Dfetch( t_data, id + 1 );
  const float4 f2 = tex1Dfetch( t_data, id + 2 );
  printf("f0.x = %f\n", f0.x);
  printf("f0.y = %f\n", f0.y);
  printf("f0.z = %f\n", f0.z);
  printf("f0.w = %f\n", f0.w);
  printf("f1.x = %f\n", f1.x);
  printf("f1.y = %f\n", f1.y);
  printf("f1.z = %f\n", f1.z);
  printf("f1.w = %f\n", f1.w);
  printf("f2.x = %f\n", f2.x);
  printf("f2.y = %f\n", f2.y);
  printf("f2.z = %f\n", f2.z);
  printf("f2.w = %f\n", f2.w);

}

int main(){

float4 *devData;
  float  *hostData;
  const int nbData = 1000;
  hostData = (float *)malloc(nbData*sizeof(float4));
  for (int i = 0; i < nbData*4; i++) hostData[i] = i;
  cudaMalloc( &devData, nbData * sizeof(float4) ) ;
  cudaMemcpy( devData, hostData, nbData * sizeof(float4), cudaMemcpyHostToDevice ) ;
  cudaBindTexture( 0, t_data, devData, cudaCreateChannelDesc<float4>(), nbData * sizeof(float4) ) ;

  mykernel<<<1,1>>>();
  cudaDeviceSynchronize();
  cudaCheckErrors("some error");
  return 0;
}
$ nvcc -o t760 t760.cu
$ cuda-memcheck ./t760
========= CUDA-MEMCHECK
f0.x = 0.000000
f0.y = 1.000000
f0.z = 2.000000
f0.w = 3.000000
f1.x = 4.000000
f1.y = 5.000000
f1.z = 6.000000
f1.w = 7.000000
f2.x = 8.000000
f2.y = 9.000000
f2.z = 10.000000
f2.w = 11.000000
========= ERROR SUMMARY: 0 errors
$

In the future, I’d suggest providing a short, complete code (DEFINITION: a code that someone else could copy, paste, compile, and run, and see the issue, without having to add anything or change anything), just as I have done, to demonstrate your problem. Don’t dump your entire project here or give a link to a github repo. Reduce your code down to demonstrate just the issue. The reason for this is (at least) threefold:

  1. If you desire help from someone else, then respect their time, and minimize the amount of work they have to do to help you. Why should I have the burden to create a complete code around the little snippets you have shown?

  2. Frequently (as I suspect is the case here), when you distill your problem down into snippets that you haven’t actually tested yourself, you eliminate (or overlook) the actual issue in creating your snippets. This is quite understandable since you don’t actually know what the problem is, or you wouldn’t be asking for help. By providing a complete code that demonstrates the issue, that you have tested yourself, you minimize the possibility of falling into this trap. Furthermore, since you have usually demonstrated the issue in a single step, it makes the help process efficient; it’s frequently the case that in a single response, the person who is trying to help you can communicate the problem, and a proposed fix. Instead we now have this situation where we are going back and forth because your code actually works.

  3. Very often in the process of creating this minimal reproducer code, you discover the actual problem yourself, and don’t need to ask for help.

I frequently just skip past questions where people have not demonstrated the effort to provide a complete code. In my view they are not serious about wanting help, and are not serious about respecting other people’s time. Occasionally, I break that pattern, as I am doing here.

Thank you for your reply.
I totally understand what you mean. I really didn’t want to disrespect the members of the forum.

The problem is that I can’t give my code (it is protected by my company) and I don’t manage to reproduce the issue in a simpler kernel. When I code a test kernel (only reading the texture), it works…

I just wanted to know if someone had already encountered such a case.

Sorry for the bother.

Best regards.