Library for conversion of cuda datatypes?

Is there a way to generically convert a vector of one type to another that does the right thing with regard to coalesced reads/writes and bank conflicts?

For my case, I need to convert a uint8_t vector to float. I can’t find anything in thrust, cublas or cudnn (1x1 convs on UINT8 needs a multiple of 4 on feature channels) that does this.

If not, does anyone know of a gist or git repo for this? I can’t find anything there either.

trove may be of interest to load/store certain numbers of elements per thread. Thereafter it should just be a casting operation.

https://github.com/bryancatanzaro/trove

I may be misinterpreting what you are asking for.

close, but the T type in trove looks like it needs to be a multiple of 4

I’m looking for a function that takes N elements of uint8_t and copies it into N elements of float in the gpu.

The only efficient way that I can see is to have each thread read 128 bytes and store it into shared memory like [0,32,64,96,1,33,65,97…] so thread 0 can write element 0 and thread 1 can write element 1, etc… in coalesced writes. This seems like a nightmare…

I think trove can help a lot here, it might not do everything.

The operation that you want to perform can be decomposed into several steps. It’s important to understand that this can all be handled at the warp level. Don’t get confused thinking that you have to deal with indexing larger than what will be handled by a warp, to figure out the algorithm.

  1. Do a coalesced load of uchar4, one per warp lane
  2. Convert each uchar4 to a set of 4 float values
  3. Perform a 4x32->32x4 transpose at the warp level
  4. Do 4 successive (coalesced) writes at the warp level, of float values

Operations 1 and 4 are where the concern with coalescing comes in. Coalescing is a concept applicable to the warp level. Think of everything at the warp level. So 1 and 4 should be straightforward. 2 should be straightforward.

3 is the sticky part. I’m pretty sure you can do this with trove.

However, here is another write-up that may be illustrative:

https://stackoverflow.com/questions/53600410/basic-cuda-load-and-warp-transpose/53625367#53625367

Again, it may be that I still don’t understand what you are asking.

Also note that the efficiency loss associated with just doing the trivially simple per-thread:

  1. load uchar
  2. convert to float
  3. store float

might not be that big a deal, especially considering the complexity of the above approach. YMMV. May want to do some preliminary benchmarking to see if the effort is worth it.

and now that I think about it, just doing this:

  1. load uchar4
  2. convert to float4
  3. store float4

is probably best. Doh! It depends on proper alignment of course.

So I learned a bunch in the last 30 minutes… I checked and my previous benchmark code was using “i += gridDim.x” and not “i += blockDim.x * gridDim.x” and my dull end of week thinking was that the speed issue were coming from serialized reads… Forgot gridDim was just the amount of blocks.

With that, you’re right, definitely not worth the effort, it’s still faster to read uint8 and write float than it is to read float and write float, plus I gain the bandwidth savings from the uint8 memcpy from host to device.

Second, for some reason I didn’t realize the shfl intrinsics could take different lane ids for each thread, really good to know. Only ever saw them used with the “for (int offset = 16; offset > 0; offset /= 2)” type thing.

Thanks for the help and useful info!

Loading sub-word (i.e. smaller than 32-bit) items did have a noticeable performance impact with older GPU architectures. My memory is a bit hazy, but I want to say up to and including the Kepler architecture. Taking care of pointer alignment and then using uchar4 to load byte-sized data in groups of four could have a noticeable impact, as I recall from specific cases of compute kernels used by NPP.

A not-very-scientific comparison:

$ cat t1522.cu

__global__ void k1(const unsigned char * __restrict__ i, float * __restrict__ o, size_t n){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n)
    o[idx] = (float)i[idx];
}

__global__ void k2(const unsigned char * __restrict__ i, float * __restrict__ o, size_t n){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n){
    uchar4 temp1 = reinterpret_cast<const uchar4 *>(i)[idx];
    float4 temp2;
    temp2.x = (float)temp1.x;
    temp2.y = (float)temp1.y;
    temp2.z = (float)temp1.z;
    temp2.w = (float)temp1.w;
    reinterpret_cast<float4 *>(o)[idx] = temp2;}
}

int main(){

  const int n  = 1048576*32;
  const int n4 = n>>2;
  float *d_o;
  cudaMalloc(&d_o, n*sizeof(float));
  unsigned char *d_i;
  cudaMalloc(&d_i, n*sizeof(unsigned char));
  k1<<<n/256, 256>>>(d_i, d_o, n);
  k1<<<n/256, 256>>>(d_i, d_o, n);
  k2<<<n4/256, 256>>>(d_i, d_o, n4);
  k2<<<n4/256, 256>>>(d_i, d_o, n4);
  cudaDeviceSynchronize();
}
$ nvcc -o t1522 t1522.cu
$ nvprof ./t1522
==16890== NVPROF is profiling process 16890, command: ./t1522
==16890== Profiling application: ./t1522
==16890== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   52.55%  505.77us         2  252.89us  252.68us  253.10us  k1(unsigned char const *, float*, unsigned long)
                   47.45%  456.75us         2  228.37us  227.97us  228.77us  k2(unsigned char const *, float*, unsigned long)
      API calls:   96.68%  354.22ms         2  177.11ms  291.41us  353.93ms  cudaMalloc
                    1.48%  5.4063ms         4  1.3516ms  629.97us  3.3603ms  cuDeviceTotalMem
                    1.38%  5.0563ms       388  13.031us     333ns  534.58us  cuDeviceGetAttribute
                    0.26%  940.51us         1  940.51us  940.51us  940.51us  cudaDeviceSynchronize
                    0.16%  598.02us         4  149.50us  98.163us  274.39us  cuDeviceGetName
                    0.03%  110.85us         4  27.713us  11.288us  71.846us  cudaLaunchKernel
                    0.01%  26.361us         4  6.5900us  3.3390us  11.862us  cuDeviceGetPCIBusId
                    0.00%  7.8680us         8     983ns     425ns  1.5470us  cuDeviceGet
                    0.00%  3.6060us         3  1.2020us     362ns  1.7520us  cuDeviceGetCount
                    0.00%  2.6740us         4     668ns     502ns     914ns  cuDeviceGetUuid
[user2@dc10 misc]$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1522
==16903== NVPROF is profiling process 16903, command: ./t1522
==16903== Profiling application: ./t1522
==16903== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   58.40%  2.7465ms         2  1.3733ms  1.3670ms  1.3795ms  k1(unsigned char const *, float*, unsigned long)
                   41.60%  1.9568ms         2  978.39us  977.72us  979.07us  k2(unsigned char const *, float*, unsigned long)
      API calls:   96.67%  201.04ms         2  100.52ms  276.27us  200.77ms  cudaMalloc
                    2.26%  4.6958ms         1  4.6958ms  4.6958ms  4.6958ms  cudaDeviceSynchronize
                    0.58%  1.2089ms        97  12.462us     348ns  506.92us  cuDeviceGetAttribute
                    0.30%  621.83us         1  621.83us  621.83us  621.83us  cuDeviceTotalMem
                    0.13%  262.93us         4  65.731us  10.371us  222.85us  cudaLaunchKernel
                    0.05%  112.98us         1  112.98us  112.98us  112.98us  cuDeviceGetName
                    0.01%  13.191us         1  13.191us  13.191us  13.191us  cuDeviceGetPCIBusId
                    0.00%  6.3200us         3  2.1060us     411ns  4.1660us  cuDeviceGetCount
                    0.00%  2.4830us         2  1.2410us     500ns  1.9830us  cuDeviceGet
                    0.00%     825ns         1     825ns     825ns     825ns  cuDeviceGetUuid
$

V100: ~10%
K20: ~30%

It was nagging me to leave this unwritten, I added a shfl_sync impl to your “unscientific” tests. You’re right, definitely wasn’t as bad as I thought it would be to implement. Has to be multiple of 128 length though.

#include <iostream>

using namespace std;

#define FULL_MASK 0xffffffff

__global__ void k1(const unsigned char * __restrict__ i, float * __restrict__ o, size_t n){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n)
    o[idx] = (float)i[idx];
}

__global__ void k2(const unsigned char * __restrict__ i, float * __restrict__ o, size_t n){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n){
    uchar4 temp1 = reinterpret_cast<const uchar4 *>(i)[idx];
    float4 temp2;
    temp2.x = (float)temp1.x;
    temp2.y = (float)temp1.y;
    temp2.z = (float)temp1.z;
    temp2.w = (float)temp1.w;
    reinterpret_cast<float4 *>(o)[idx] = temp2;}
}

__global__ void k3(const unsigned char * __restrict__ d_in, float * __restrict__ d_out, size_t n){
    const int mask = 0xff;
    const int src = threadIdx.x >> 2;
    const int rem = threadIdx.x & 0x3;

    const int shift = 8*rem;
    const int idx = threadIdx.x + blockDim.x*blockIdx.x;

    const int v = reinterpret_cast<const int *>(d_in)[idx];

    #pragma unroll
    for (int i = 0; i < warpSize; i+=8) {
        const float o = (float)((__shfl_sync(FULL_MASK, v, i+src) >> shift) & mask);
        d_out[threadIdx.x + 4*i + 128*blockIdx.x] = o;
    }
}
 
int main() {
      const int n  = 1048576*32;
      const int n4 = n>>2;
      float *d_o;
      cudaMalloc(&d_o, n*sizeof(float));
      unsigned char *d_i;
      cudaMalloc(&d_i, n*sizeof(unsigned char));
      k1<<<n/256, 256>>>(d_i, d_o, n);
      k1<<<n/256, 256>>>(d_i, d_o, n);
      k2<<<n4/256, 256>>>(d_i, d_o, n4);
      k2<<<n4/256, 256>>>(d_i, d_o, n4);
      k3<<<n4/256, 256>>>(d_i, d_o, n4);
      k3<<<n4/256, 256>>>(d_i, d_o, n4);
      cudaDeviceSynchronize();
}

My output on a 2080ti:

==24203== NVPROF is profiling process 24203, command: ./a.out
==24203== Profiling application: ./a.out
==24203== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   45.98%  817.76us         2  408.88us  407.61us  410.14us  k1(unsigned char const *, float*, unsigned long)
                   38.63%  687.10us         2  343.55us  340.16us  346.94us  k2(unsigned char const *, float*, unsigned long)
                   15.39%  273.70us         2  136.85us  136.00us  137.70us  k3(unsigned char const *, float*, unsigned long)
      API calls:   97.61%  107.56ms         2  53.779ms  89.507us  107.47ms  cudaMalloc
                    1.60%  1.7636ms         1  1.7636ms  1.7636ms  1.7636ms  cudaDeviceSynchronize
                    0.40%  445.20us        97  4.5890us     164ns  244.20us  cuDeviceGetAttribute
                    0.31%  340.68us         1  340.68us  340.68us  340.68us  cuDeviceTotalMem
                    0.04%  45.473us         1  45.473us  45.473us  45.473us  cuDeviceGetName
                    0.03%  35.743us         6  5.9570us  3.2230us  16.227us  cudaLaunchKernel
                    0.00%  3.3290us         1  3.3290us  3.3290us  3.3290us  cuDeviceGetPCIBusId
                    0.00%  1.5670us         3     522ns     173ns  1.1620us  cuDeviceGetCount
                    0.00%  1.0510us         2     525ns     213ns     838ns  cuDeviceGet
                    0.00%     282ns         1     282ns     282ns     282ns  cuDeviceGetUuid

And my test prog:

#include <iostream>

using namespace std;

#define FULL_MASK 0xffffffff

const int mask = 0xff;

__global__ void kernel(const unsigned char* __restrict__ d_in, float* __restrict__ d_out) {
    const int src = threadIdx.x >> 2;
    const int rem = threadIdx.x & 0x3;

    const int shift = 8*rem;
    const int idx = threadIdx.x + blockDim.x*blockIdx.x;

    const int v = reinterpret_cast<const int *>(d_in)[idx];

    #pragma unroll
    for (int i = 0; i < warpSize; i+=8) {
        const float o = (float)((__shfl_sync(FULL_MASK, v, i+src) >> shift) & mask);
        d_out[threadIdx.x + 4*i + 128*blockIdx.x] = o;
    }
}

int main() {
    const int n = 32*4*2; //128 elems processed per warp
    unsigned char* h_in = (unsigned char*)malloc(n*sizeof(unsigned char));
    for (int i = 0; i < n; i++) {
        h_in[i] = (unsigned char)i;
    }
    unsigned char* d_in;
    cudaMalloc(&d_in, n*sizeof(unsigned char));
    cudaMemcpy(d_in, h_in, n*sizeof(unsigned char), cudaMemcpyHostToDevice);
    
    float* h_out = (float*)malloc(n*sizeof(float));
    float* d_out;
    cudaMalloc(&d_out, n*sizeof(float));
    
    kernel<<<2,32>>>(d_in,d_out);
    cudaDeviceSynchronize();

    cudaMemcpy(h_out, d_out, n*sizeof(float), cudaMemcpyDeviceToHost);
    for (int i = 0; i < n; i++) {
        if (h_out[i] != (float)i) {
            cout << "wrong value!" << endl;
        }
    }
}

I’m not sure your code does what you think it does, or else I’m unsure of what you’re claiming, exactly.

If you’re suggesting that your code (k3) successfully processed 32M elements of data on 2080Ti in 136us when my code (k2) took 340us to do the same thing, I don’t believe you.

32MB of data are being loaded, and 128MB of data are being written, for a total of 160MB

160MB/0.000340s = 470GB/s, which is roughly consistent with the published 2080Ti main memory bandwidth (616GB/s peak theoretical).

160MB/0.000136s = 1.1TB/s, well beyond the published 2080Ti bandwidth.

Seems fishy.

You’re right… Good catch! I didn’t handle the case where the block size is greater than 32.

#include <iostream>

using namespace std;

#define FULL_MASK 0xffffffff

__global__ void k1(const unsigned char * __restrict__ i, float * __restrict__ o, size_t n){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n)
    o[idx] = (float)i[idx];
}

__global__ void k2(const unsigned char * __restrict__ i, float * __restrict__ o, size_t n){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n){
    uchar4 temp1 = reinterpret_cast<const uchar4 *>(i)[idx];
    float4 temp2;
    temp2.x = (float)temp1.x;
    temp2.y = (float)temp1.y;
    temp2.z = (float)temp1.z;
    temp2.w = (float)temp1.w;
    reinterpret_cast<float4 *>(o)[idx] = temp2;}
}

__global__ void k3(const unsigned char * __restrict__ d_in, float * __restrict__ d_out, size_t n){
    const int mask = 0xff;
    const int src = threadIdx.x >> 2;
    const int rem = threadIdx.x & 0x3;

    const int shift = 8*rem; 
    const int idx = threadIdx.x + blockDim.x*blockIdx.x;

    const int v = reinterpret_cast<const int *>(d_in)[idx];

    #pragma unroll
    for (int i = 0; i < warpSize; i+=8) {
        const float o = (float)((__shfl_sync(FULL_MASK, v, i+src) >> shift) & mask);
        d_out[threadIdx.x + 4*i + 128*blockIdx.x] = o;
    }
}
 
int main() {
      const int n  = 1048576*32;
      const int n4 = n>>2;
      float *d_o;
      cudaMalloc(&d_o, n*sizeof(float));
      unsigned char *d_i;
      cudaMalloc(&d_i, n*sizeof(unsigned char));
      k1<<<n/256, 256>>>(d_i, d_o, n);
      k1<<<n/256, 256>>>(d_i, d_o, n);
      k2<<<n4/256, 256>>>(d_i, d_o, n4);
      k2<<<n4/256, 256>>>(d_i, d_o, n4);
      k3<<<n4/32, 32>>>(d_i, d_o, n);
      k3<<<n4/32, 32>>>(d_i, d_o, n);
      cudaDeviceSynchronize();
}

Is now giving me:

==30294== NVPROF is profiling process 30294, command: ./a.out
==30294== Profiling application: ./a.out
==30294== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   37.40%  830.34us         2  415.17us  415.07us  415.26us  k1(unsigned char const *, float*, unsigned long)
                   32.72%  726.47us         2  363.23us  319.78us  406.69us  k3(unsigned char const *, float*, unsigned long)
                   29.88%  663.46us         2  331.73us  331.62us  331.84us  k2(unsigned char const *, float*, unsigned long)
      API calls:   97.44%  110.67ms         2  55.334ms  87.873us  110.58ms  cudaMalloc
                    1.94%  2.2052ms         1  2.2052ms  2.2052ms  2.2052ms  cudaDeviceSynchronize
                    0.28%  316.20us        97  3.2590us     148ns  138.51us  cuDeviceGetAttribute
                    0.27%  311.61us         1  311.61us  311.61us  311.61us  cuDeviceTotalMem
                    0.03%  38.335us         1  38.335us  38.335us  38.335us  cuDeviceGetName
                    0.03%  35.678us         6  5.9460us  3.1190us  16.400us  cudaLaunchKernel
                    0.00%  1.6400us         1  1.6400us  1.6400us  1.6400us  cuDeviceGetPCIBusId
                    0.00%  1.1880us         3     396ns     148ns     807ns  cuDeviceGetCount
                    0.00%     919ns         2     459ns     158ns     761ns  cuDeviceGet
                    0.00%     249ns         1     249ns     249ns     249ns  cuDeviceGetUuid

a good 12us speed up on the minimum lol