cudaMemset or cudaMemset2D set memory with float values

hi,

I would like to use some memory, where I set my float array with 1.

I’d like to do something like this :

float * myArray;

cudaMalloc(&myArray, nbElemt*sizeof(float));

cudaMemset(myArray,1f, nbElemt); <- here I want to tell cuda to set 1 as float

of course this doesn’t work.

is there an alternative to perform this ?

I found a solution but it’s kinda ugly :

cudaMemset(data,0,data_size);

char * pointer = (char *)array->data + 2;

cudaMemset(pointer,4,128,1,data_size/4);

pointer += 1;

cudaMemset(pointer,4,63,1,data_size/4);

since 1f correspond to 4 bytes where the two first bytes are 0, the third one 128 and the last one 63.

is there a better way to do that ?

and also is cuda faster doing this: cudaMemset(data,0,data_size);

or this : cudaMemset(data,2,0,2,data_size/2);

since I don’t need to set the two last bytes to 0…

I noticed in Appendix C, section C.6.10, it mentions cuMemsetD32:

CUresult cuMemsetD32(CUdeviceptr dstDevice, 

                     unsigned int value, unsigned int count);

which sets 32 bit values at a time, instead of char values of a normal cudaMemset()

Then if you use ‘int __float_as_int(float);’, section 4.4.3, to get your float to be an int without ruining it’s bit pattern, then I think this it:

cuMemsetD32(myArray, __float_as_int(1f),  nbElemt);

HTH

Garry

PS - Health Warning: I haven’t used these myself

it seems that nvcc doesn’t like it at all

first error :

trying to compile this : cuMemsetD32(myArray, __float_as_int(1f), nbElemt);

second error:

trying to compile this :

int i=0;

cudaMemsetD32(data,i,data_size);

I know in both case I am in a host function, but presumably cudamemset is suppose to work no ?

I think __float_as_int is a device function. Maybe something like

float f = 1.0f;

cuMemsetD32(myArray, reinterpret_cast<int&>(f), nbElemt);

would work. Or, in case of C (not C++) replace reinterpret_cast<int&>(f) with (int)(&f).

HTH.

/Pyry

Benoit , I am sorry for wasting your time. I should have thought more carefully. It did not occur to me to think that they would be device only. My fault.

My function is increadibly slow for big array,

I need to find a faster way to memset a float array with a given value from the host.

how come there is no equivalent as cudamemset() for float ?

and also the snipet doesn’t work on the host

extern "C"

void allOnes(float* dev_array, int size)

{

  union u_float_int

  {

    int i;

    float f;

  };

  u_float_int v;

  v.f = 1;

  printf("allones val:%f\n",v.i);

  cudaMemsetD32(dev_array,v.i,size);

}

I get a compiler error : line 241: error: identifier “cudaMemsetD32” is undefined

and as garry mentioned it the prototype is this :

CUresult cuMemsetD32(CUdeviceptr dstDevice,unsigned int value, unsigned int count);

so can anybody tell me what’s wrong ?

Are you using cudaMemsetD32 instead of cuMemsetD32?

Was anybody actually able to solve this problem? I mean, using [font=“Courier”]cudaMemset[/font] (i.e. a runtime API call) with a [font=“Courier”]float[/font] or any data type other than [font=“Courier”]int[/font]? The type casting method known from C’s [font=“Courier”]memset[/font] doesn’t seem to work.

Only [font=“Courier”]0.0f[/font] gives the expected result (which is actually not surprising at all).

Not sure if this is still of interest, but I had to battle this myself and finally figured it out, I think.

pyrtsa was right. You need to transfer your float value’s bit pattern into an unsigned int using

unsigned int val = ((unsigned int)&float_val)

If you pass the float value directly to cuMemsetD32, it won’t compile because the function wants an unsigned int. If you cast the value, as in

float val = 1000;

cuMemsetD32(pointer, (unsigned int)val, bytes)

the compiler won’t complain, but you will change the bit pattern of the float value to represent 1000 as an unsigned int, which looks very different:

float value: 1.000000e+03

float bits: 01000100 01111010 00000000 00000000

cast value: 1000

cast bits: 00000000 00000000 00000011 11101000

cast as float value: 1.401298e-42

cast as float bits: 00000000 00000000 00000011 11101000

As you can see, the same bit pattern is a very different value depending on whether it’s interpreted as an unsigned int or as a float.

So if you allocated the gpu memory as type float, your device and host functions will interpret the memory as floats, but if you memset using a cast to unsigned int, you will set 32bit memory locations to unsigned int representations of your values, very different from float representations as hopefully shown above.

I was very puzzled about this pseudo code for a day or so:

float* gpu_d = cudaMalloc(bytes);

float val = 1000;

// print value I’m setting, as float and cast unsigned int:

OUTPUT: 1000 and 1000

OK

// cast to satisfy compiler

cuMemsetD32(gpu_d, (unsigned int)val, bytes);

float* gpu_h = malloc(bytes);

// download gpu_d to gpu_h, sync everything…

// print values in gpu_h:

OUTPUT: 1.401298e-42

what??

All the confusion about cuMemsetD32 in the forums made me think it’s not doing anything at all (1.4e-42 looks a lot like memory trash), or it’s doing it wrong, but this is not so. I had a similar story with JIT compile. Either documentation could be better, or I missed an important part of it.

Also, cuda runtime 4 could have addressed this, I haven’t checked. I’m currently using driver api version 4010 and try to be portable… good times…

Here’s the code I use to print bit patterns, if anyone wants to fool around:

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

int nBits = 8;

int nBytes = 4;

int kDisplayWidth = 32;

char* pBinFill(unsigned int x,char *so, char fillChar){

// fill in array from right to left

char s[kDisplayWidth+1];

int i=kDisplayWidth;

s[i–]=0x00; // terminate string

do {

// fill in array from right to left

s[i–]=(x & 1) ? ‘1’:‘0’;

x>>=1; // shift right 1 bit

} while( x > 0);

while(i>=0) s[i–]=fillChar; // fill with fillChar

// make spaces

for(i = 0; i < kDisplayWidth; i +=8)

sprintf(so + i + i/8,"%s%s"," ", s+i);

return so;

}

int main(int argc, char** argv){

if(argc < 2){

printf("need 1 float arg\n");

return 0;

}

float val = atof(argv[1]);

char buff [kDisplayWidth+3+1]; // +3 for spaces

unsigned int ival = ((unsigned int)&val);

printf(“float value: %e\nfloat bits: %s\n”, val, pBinFill(ival, buff, ‘0’));

unsigned int cast = (unsigned int)val;

printf(“cast value: %u\ncast bits: %s\n”, cast, pBinFill(cast, buff, ‘0’));

float badval = ((float)&cast);

unsigned int ibad = ((unsigned int)&badval);

printf(“cast as float value: %e\ncast as float bits: %s\n”, badval, pBinFill(ibad, buff, ‘0’));

return 0;

}

Credits to mrwes’ post here http://stackoverflow.com/questions/111928/is-there-a-printf-converter-to-print-in-binary-format

I stole his pBinFill function.

Please note that type punning via pointer cast in the following invokes undefined behavior according to the C/C++ standards:

unsigned int val = *((unsigned int*)&float_val)

Sometimes this will happen to work as intended, but many times it will “fail”, based on real life experience. For device code, CUDA offers specific type re-interpretation functions for this purpose such as __int_as_float() and __float_as_int(). For C host code I would recommend the use of a volatile union as shown below. Note that this usage is also not sanctioned by the standards, but it appears to be safe in practice, meaning I have not seen it fail in 20+ years of use across diverse platforms:

volatile union {

   float f;

   unsigned int i;

} x;

x.f = 1.0f;

printf ("x = % 15.8e  (%08x)\n", x.f, x.i);