Reading R8G8B8A8 texture using tex2D() causes strange result.

I’m writing an application that processes textures created by DX11.
Here’s how I pass the texture to the program:

ID3D11Texture2D* input_image; //DXGI_FORMAT_R8G8B8A8_UNORM

cuGraphicsD3D11RegisterResource(&cudaResource, input_image, cudaGraphicsRegisterFlagsNone);

cuGraphicsMapResources(1, &cudaResource, 0);
cuGraphicsSubResourceGetMappedArray(&cuArray, cudaResource, 0, 0);

CUtexref cu_texref;
cuModuleGetTexRef(&cu_texref, m_Module, “inTex”);

cuTexRefSetArray(cu_texref, cuArray, CU_TRSA_OVERRIDE_FORMAT);
cuTexRefSetFilterMode(cu_texref, CU_TR_FILTER_MODE_LINEAR);
cuTexRefSetAddressMode(cu_texref, 0, CU_TR_ADDRESS_MODE_WRAP);
cuTexRefSetAddressMode(cu_texref, 1, CU_TR_ADDRESS_MODE_WRAP);

cuTexRefSetFlags(cu_texref, CU_TRSF_NORMALIZED_COORDINATES);
cuTexRefSetFormat(cu_texref, CU_AD_FORMAT_UNSIGNED_INT8, 4);

cuParamSetTexRef(m_Function, CU_PARAM_TR_DEFAULT, cu_texref);

CUDA(9.1) program:
texture<uchar4, 2, cudaReadModeElementType> inTex;

void cuda_main(unsigned char surface, int width, int height, size_t pitch)
{
int x = blockIdx.x
blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;

uchar4 output_image = tex2D(inTex, ((float)x / (float)width), ((float)y / (float)height));
// do nothing and send it to the output

}
Everything works, but the output_image does not match the input_image. Look at the picture:

External Media

It seems so I missed something, but what? Could someone help me?

are you affiliated with creators of the “Function of Reality” 3D BluRay, by chance?
Their name happens to be SanBase too.

About your problem? were do you get width, height, pitch from? The code does not show that.

Is the source texture mipmapped or something? It looks a bit like the texture access accidentially reads from different mip levels instead of the original resolution texture.

There is this CUresourcetype enum which explicitly distinguishes non mipmapped cuda arrays and mipmapped ones. I am just not sure yet where in your code this might be applicable.

CU_RESOURCE_TYPE_ARRAY = 0x00
Array resoure
CU_RESOURCE_TYPE_MIPMAPPED_ARRAY = 0x01
Mipmapped array resource
CU_RESOURCE_TYPE_LINEAR = 0x02
Linear resource
CU_RESOURCE_TYPE_PITCH2D = 0x03

Yes, it’s me :)
I did not include all the source code (lots of text). These values are correct, the problem is not in the coordinates.

Yes, but it is not significant. Actually the code looks like:
if(mips > 1)
checkCudaErrors(cuGraphicsResourceGetMappedMipmappedArray(&cuMipArray, cudaResource));
else
checkCudaErrors(cuGraphicsSubResourceGetMappedArray(&cuArray, cudaResource, 0, 0));

if (mips > 1)
{
checkCudaErrors(cuTexRefSetMipmappedArray(cu_texref, cuMipArray, CU_TRSA_OVERRIDE_FORMAT));
checkCudaErrors(cuTexRefSetMipmapFilterMode(cu_texref, CU_TR_FILTER_MODE_LINEAR));
}
else
{
checkCudaErrors(cuTexRefSetArray(cu_texref, cuArray, CU_TRSA_OVERRIDE_FORMAT));
checkCudaErrors(cuTexRefSetFilterMode(cu_texref, CU_TR_FILTER_MODE_LINEAR));
}

there is a CUresult cuGraphicsResourceGetMappedMipmappedArray() API call you could try instead of the
CUresult cuGraphicsSubResourceGetMappedArray() one…

If you’re still into Mandelbulbs and 3D fractals, check out the Netflix movie Annihilation. This one uses Mandelbulbs a lot in their special effects.

Here’s an article about the visual effects

there’s 2D IFS fractals (like Scott Drave’s Electric Sheep) in the credits scene too…

Unfortunately the movie does not have the best storytelling and characters.

Okay, I am officially out of ideas. Maybe someone else could chime in?

Having a complete, compilable repro case of the broken code (even if stripped down a lot) would allow people to reproduce the problem and will increase the likelyhood of getting a solution.

Christian

I do not want to discuss my artworks here, I have a more important problem :) I bought Titan V and I want to calculate 3D fractals with double precision.
Once again - the problem is not in coordinates or format, it looks like a normalization error. Such a feeling that CUDA trying to normalize the linear uchar or something like this. It’s obviously necessary to add some parameter, but I do not know what I’m missing.
I hope there will be someone from the moderators. They know CUDA thoroughly.

This is a community-based forum. I think cbuchner1 knows CUDA pretty well. Probably better than me. I value the contributions by cbuchner1 on this forum. I believe the advice given you was sound advice.

Christian, I cannot provide completed source code, because the CUDA is only a small part of the huge project, but I can show all cuda parts:
Cuda.cpp: http://www.sanbasestudio.com/tmp/CUDA.cpp
CudaCore.h: http://www.sanbasestudio.com/tmp/CudaCore.h
test.cu: http://www.sanbasestudio.com/tmp/test.cu
and (just in case)
Preset.cpp: http://www.sanbasestudio.com/tmp/Preset.cpp // create and run cuda class.

It seems to me that the symptoms of the disease are quite typical and the professional can easily cure the patient.

You’re welcome to wait for the professional to come along and cure the patient.

Regarding the request, nobody is asking you for your whole code. Nobody really wants to look at it all anyway, and it’s far better if you remove the unnecessary items.

What is being asked for is a self-contained, minimal, and complete code, that someone else could run, and see the issue, without having to add anything or change anything. I don’t think this is a bizarre request, as it is codified on at least one other similar community-based software development help forum, and discussed in several canonical articles on the web.

There is no doubt this requires effort on your part. But as already mentioned, if you prefer, you’re welcome to wait for the professional to come along and cure the patient.

I may consider myself professional because I make heavy use of CUDA at work and in some hobby and blockchain related projects.

Right now there’s only half the patient (source code) at the doctor’s office. That leaves some guesswork, even for a professional :-)

Here’s a bit of reasoning why I still think we’re seeing mipmap data in the output.

-except for a bit of weird color bands inbetween you’re getting the expected grayscale values back. We see the lower half of your grayscale gradient from your original image - but repeated several times.

-the color artefacts inbetween might be uninitialized memory inbetween mip levels (graphics APIs often prefer to store textures as power of 2 sizes, so whenever your source texture isn’t a power of two there will be some padding applied - and it is not necessarily getting 0-initialized for efficiency. Random data may show up here.)

Is it possible for you to upload some other data into the DirectX texture? say, the well known Lenna test image. It would be interesting to see if it can be recognized in the output.

1024x1024 grayscale TIFF version here: https://www.cis.rit.edu/~cnspci/courses/common/images/lenna-1024x1024.tif

That is the first thought I had as well. Has this been eliminated conclusively as the source of the observation? I would suggest reviewing all dimensions, modes, and types used for texture binding and access. If you are not 100% sure about an item (e.g. a texture mode), experiment by trying different settings to see what they do. I find experimentation helpful because failed experiments add information, too.

Is proper CUDA error checking in place? A failed call to a texture-related API could lead to funny results downstream.

Actually I do not care about the source code, in any case I’m going to publish it on GitHub, but the project is very big (10 GB). I can make a compact sample, although it will take a lot of work. To begin with, the application is actually written on DX12, not on DX11.
But I’m surprised that no one has encountered this problem, although it looks pretty typical. I noticed that almost all the examples on CUDA work with the float format and there is very rarely used the Driver API. Such a feeling that integration with existing graphic APIs is of little concern to anyone.

OK. Let’s move step by step:

  1. Concerning mipmap: I created the texture without mips - the same result. So it is not connected with mipmaps.
  2. One more texture:
    External Media
  3. “Lena”: http://www.sanbasestudio.com/tmp/bug4.jpg (I converted it from grayscale to RGBA before usage).

As you can see, if the value is more than 127 - everything is OK (almost, because exists artifacts around 255).
If value less than 127 - we have something like frag(1.0/v).

Yes it is. If you look at the Cuda.cpp (see above), you can see, that I use checkCudaErrors on every step.

You may find the following anecdote elucidating (or not :-). I was the very first user of the CUDA runtime API when it was first created by a colleague. Prior to that, only the driver API existed. This was at a time when only a handful of people inside NVIDIA knew of CUDA, I think mid 2006. Never ever thereafter did I use the driver API.

That impression is probably accurate. CUDA was created by people who believed in the utility of general purpose computing on GPUs, including high-performance computing. The primary goal was not to create yet another way to deal with graphics (besides OpenGL and DirectX, now Vulkan). Over the years, some graphics stuff crept back into CUDA. Compute and graphics share physical resources, may as well allow them to be shared in a programmer visible way. Personally, I am not fond of this development.

But the primary use of CUDA is for high-performance computing of one kind or other (including AI). Few CUDA users use, or a well-versed in, graphics interop, I would say. I think it is best if graphics-oriented applications first look at the use of compute shaders in the graphics APIs for their compute needs, before deciding to use CUDA and graphics interop. Other people may have a different take on this.

I also need high-performance computing but my application displays the result of such computing on the screen, using DX12 interface. Besides, my CUDA program should read DX12 textures. Unfortunately CUDA do not have DX12 interop, but DX12 has DX11 wrapper and I use it, so my code looks pretty eclectic :) Why driver API? I just need to download and modify cuda programs on the fly (like a shader), so the program cannot be hardcoded.

OK but now I need to solve the texture problem! Really no one has seen an example reading RGBA texture in the driver API?

P.S.
Why do I need CUDA if I use DX12? Double precision!!! Only one reason… HLSL do not have functions working with double.

I don’t use DirectX myself, but from an old thread on these forums I once copied the following information:

Does NVIDIA’s DirectX implementation set these feature flags to FALSE, or do you need additional operations not covered by the above? If so, which ones? All NVIDIA GPUs support double precision, although the throughput for double-precision operations is very low for most of them (relative to the throughput of single-precision operations), in particular on consumer GPUs. It stands to reason that NVIDIA would offer the double-precision operations specified by DirectX, but I don’t know pone way or the other.

on a texture reference, cudaReadModeElementType can only be used with point sampling. For bilinear interpolation, normalized coordinates must be used (which you do correctly) and you would use cudaReadModeNormalizedFloat and for a RBGA texture you get a float4 ranged 0.0f to 1.0f

what you are seeing may be the result of the texture interpolator (floating point output) being cast to and being interpreted as an integer, leading to this bizarre color banding and overflow behavior.

Below, I post a simple example of 2D interpolation using the runtime API, in case it helps. Note the use of cudaReadModeNormalizedFloat mentioned by cbuchner1. The output of the program should look as follows:

reading array straight
    10.00000      20.00000      30.00000
    40.00000      50.00000      60.00000
    70.00000      80.00000      90.00000
   100.00000     110.00000     120.00000
reading array shifted 0.5 in x-direction
    15.00000      25.00000      30.00000
    45.00000      55.00000      60.00000
    75.00000      85.00000      90.00000
   105.00000     115.00000     120.00000
reading array shifted 0.5 in y-direction
    10.00000      20.00000      30.00000
    25.00000      35.00000      45.00000
    55.00000      65.00000      75.00000
    85.00000      95.00000     105.00000
#include <stdlib.h>
#include <stdio.h>

// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call)                                          \
do {                                                                  \
    cudaError_t err = call;                                           \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR()                                          \
do {                                                                  \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString(err) );       \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaThreadSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                 __FILE__, __LINE__, cudaGetErrorString( err) );      \
        exit(EXIT_FAILURE);                                           \
    }                                                                 \
} while (0)

texture<unsigned char, 2, cudaReadModeNormalizedFloat> tex;

__global__ void kernel (int m, int n, float shift_x, float shift_y) 
{
    float val;
    for (int row = 0; row < m; row++) {
        for (int col = 0; col < n; col++) {
            val = 255.0 * tex2D (tex, col+0.5f+shift_x, row+0.5f+shift_y);
            printf ("%12.5f  ", val);
        }
        printf ("\n");
    }
}

int main (void)
{
    int m = 4; // height = #rows
    int n = 3; // width  = #columns
    size_t pitch, tex_ofs;
    unsigned char arr[4][3]= {{10,20,30},{40,50,60},{70,80,90},{100,110,120}};
    unsigned char *arr_d = 0;

    CUDA_SAFE_CALL(cudaMallocPitch((void**)&arr_d,&pitch,n*sizeof(*arr_d),m));
    CUDA_SAFE_CALL(cudaMemcpy2D(arr_d, pitch, arr, n*sizeof(arr[0][0]),
                                n*sizeof(arr[0][0]),m,cudaMemcpyHostToDevice));
    tex.normalized = false;
    tex.filterMode = cudaFilterModeLinear;
    CUDA_SAFE_CALL (cudaBindTexture2D (&tex_ofs, &tex, arr_d, &tex.channelDesc,
                                       n, m, pitch));
    if (tex_ofs !=0) {
        printf ("tex_ofs = %zu\n", tex_ofs);
        return EXIT_FAILURE;
    }
    printf ("reading array straight\n");
    kernel<<<1,1>>>(m, n, 0.0f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted 0.5 in x-direction\n");
    kernel<<<1,1>>>(m, n, 0.5f, 0.0f);
    CHECK_LAUNCH_ERROR();
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    printf ("reading array shifted 0.5 in y-direction\n");
    kernel<<<1,1>>>(m, n, 0.0, -0.5f);
    CUDA_SAFE_CALL (cudaDeviceSynchronize());
    CUDA_SAFE_CALL (cudaFree (arr_d));
    return EXIT_SUCCESS;
}

Yes, first of all I need vector operations, such as a dot(), cross(), length(). Also pow(), exp(), log() etc. Some trigonometric functions are also not superfluous. Therefore, a simple arithmetic is definitely not enough for me.

I know, therefore I bought the Titan V :)

Thank you! I’ll try to disasсemble this code tomorrow, today I’m tired like a dog… I already tested runtime API and EVERYTHING WORKS CORRECTLY. So it is a feature of the Driver API and perhaps result of my meager knowledge…

Yes, most likely you are right. It remains to understand how to fix this. I created a simple example, as you asked. Take a look, please:

http://sanbasestudio.com/tmp/DX11_CUDA_Test.zip

I hope you can force this application to work properly.

So, this is a bug in CUDA, isn’t it?
P.S.
Before you compile the sample, do not forget to change the path to the CUDA toolkit in Project->Directory. (In my case it is C:/CUDA/Tooolkit_9.1)