64 bit add.cc (among others)

Is there a 64-bit equivalent for add.cc.u32? Using add.cc.u64 produces an error: “error : Modifier ‘.cc’ applies only to 32-bit integer types”

This is with CUDA 6.0 and VS 2012. If there’s not a defined add.cc.u64 (or something similar), what would be a good way to get around this?

Are you trying to do extended-precision integer adds beyond 64-bits? If so, then note that, at this time, the 64-bit integer adds are converted by ptxas to a sequence of 32-bit integer operations anyway. So just do your extended-precision work 32 bits at a time.

njuffa is a wizard with these things. The following SO questions may be of interest:



Note that if you are somwhat performance-tolerant, you do not have to resort to PTX for this, as the last example above demonstrates.

I should have been a little more descriptive. I’m trying to do integer adds and subs with 64-bit variables (unsigned long long). I’m working on modifying code written by someone else who used explicit PTX assembly for 32-bit operations. The .cc operations are the only ones that have an issue with 64-bit.

The numbers are being passed in as a 32-bit array of length 32. For simplicity, I’d like to keep the array (for now), as I’m not sure how I could pass in, say, 128-bit numbers without a lot of juggling.

What I’d like to do is to convert those 32-bit operations directly to 64-bit, if possible. I’m sure at some point it’d be nice to extend it beyond the 64-bit limit, but for now, 64-bit would be sufficient.

I’ll dig into njuffa’s examples from the links you provided. Thanks!

I’m still trying to figure this one out.

If I have a 64-bit variable that’s passed in, how can I do, for instance add.cc? Would I need to split the 64-bit variable into 2 32-bit variables and then do two add.cc.u32 operations?

Maybe something like this: http://stackoverflow.com/questions/2810280/how-to-store-a-64-bit-integer-in-2-32-bit-integers-and-convert-back-again (?)

The function __internal_add128() in the CUDA header file math_functions_dbl_ptx3.h provides a worked example. There is no support for 64-bit operands with add.cc, so you would need to split such operands into 32-bit chunks using the mov.b64 instruction and operate on those. There is no native 64-bit integer support in the GPU, all such operations exposed at PTX level are emulated.

It would probably simplify your code if big integers were simply represented as arrays of unsigned int, because you will need to process the data in 32-bit chunks anyhow.

Currently, they are, but it introduces a limitation (at least based on the way the rest of the code is written) of numbers no larger than 2^1018-1. I’d like to try and extend past that. First obvious thing to try is to simply double the size of the array so more data can go in. But then I run into issues with the processing of the data, like a “too many resources requested at launch” error.

(ECM_GPU_NB_DIGITS = 64 here)

__shared__ VOL digit_t b_temp_r[ECM_GPU_NB_DIGITS][1024/ECM_GPU_NB_DIGITS]; // [1024/ECM_GPU_NB_DIGITS][ECM_GPU_NB_DIGITS] ECM_GPU_CURVES_BY_BLOCK
  __shared__ VOL carry_t b_cy[ECM_GPU_NB_DIGITS][1024/ECM_GPU_NB_DIGITS]; 

  __shared__ VOL digit_t b_t[ECM_GPU_NB_DIGITS][1024/ECM_GPU_NB_DIGITS];
  __shared__ VOL digit_t b_u[ECM_GPU_NB_DIGITS][1024/ECM_GPU_NB_DIGITS];
  __shared__ VOL digit_t b_v[ECM_GPU_NB_DIGITS][1024/ECM_GPU_NB_DIGITS];
  __shared__ VOL digit_t b_w[ECM_GPU_NB_DIGITS][1024/ECM_GPU_NB_DIGITS];

I’m pretty sure these variables are the culprits. By default, they end up being 32 by 32 with blocks that call threads of 32 by 32. This is pretty much where I got stuck before.

Thanks for taking the time to try and help out here. Your depth of knowledge is very clear from reading your other answers.

I had asked initially if you are trying to do extended precision (beyond 64 bits). It seems that you are?

If your approach to take a code and modify its range is to try and convert things from 32-bit to 64-bit, I’d recommend against that. At some point, the processing will have to be done 32-bits at a time.

I’d suggest rather understanding the previous issues you ran into, such as “too many resources requested for launch” and learn what those are and how to work around them.

facepalm

Thanks.

As for the extended precision, that would be down the line. I figured going to 64 bit (or, put another way, doubling the data I could pass into the kernel) would be a little simpler and allow me to learn more about CUDA programming, memory management and so on.

We’re speaking at cross purposes. You mentioned “numbers no larger than 2^1018-1.” I call that extended precision. It appears you are working on an “extended precision” code. I would leave the fundamental computation unit at 32-bits.

Ah, I see what you mean. I thought you were referring to using larger than 32-bit variables to represent the number. My mistake.