When will cuda 8 be released? 1080 can't run with 7.5

I can’t run my program on 1080 with cuda 7.5 so is there a schedule for cuda 8.0? Thanks

are you updated the drivers? it seems that cuda programs should be compiled on the fly from ptx by the drivers

I updated the drivers but did not recompile my program. Do I need to recompile?

does you included ptx in the binaries? you may also try to compile cuda examples, with provided makefiles, just to check that the problem is not on your side

(i’m just a plain user, not an nvidia engineer)

ptx should be in binaries(I am not sure). I am using driver 367.18 with cuda 7.5 on CentOS 7.2 . The cuda examples could run now. But my projects still have errors. I have recompiled my code but my kernel functions always return error code 8 which means the requested device function does not exist or is not compiled for the proper device architecture.
It’s weird.

sounds like you’re not compiling with PTX

@txbob Does this matter? I have recompiled all my code. My code works all fine on old architecture. And one more question: is it for sure that 1080 can work with cuda 7.5 for now?

The only way a code compiled for CUDA 7.5 could work on a Pascal device is if:

  1. You compile the code with embedded PTX
  2. You have an appropriate driver (such as those that would be suitable for Pascal products) that can JIT-compile your code to run correctly on Pascal.

The above mechanism is how your sample code/projects are able to run successfully on Pascal.

Yes, it matters.

@txbob Thanks for the prompt reply. Any idea about compiling with embedded PTX? I did a quick google search but seems unlucky. Should I wait for cuda 8 or continue to work it out?

It’s described in the nvcc manual:

http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architectures

Since you are running on linux, it may be as simple as:

nvcc -arch=sm_52 <your additional compile command line here>

@txbob I am using camke and I am afraid I’ve already done this:

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -s")
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} --default-stream per-thread")

LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_20)
LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_21)
LIST(APPEND CUDA_OPTS -gencode arch=compute_30,code=sm_30)
LIST(APPEND CUDA_OPTS -gencode arch=compute_35,code=sm_35)
LIST(APPEND CUDA_OPTS -gencode arch=compute_50,code=sm_50)
LIST(APPEND CUDA_OPTS -gencode arch=compute_52,code=sm_52)
LIST(APPEND CUDA_OPTS --ptxas-options=-v)

...

       CUDA_ADD_LIBRARY(sharedlib ${project_LIB_SRCS} OPTIONS ${CUDA_OPTS} SHARED)
       TARGET_LINK_LIBRARIES(sharedlib ${project_LIBS})

...

Am I doing something wrong here?

Yes.

None of these:

LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_20)
LIST(APPEND CUDA_OPTS -gencode arch=compute_20,code=sm_21)
LIST(APPEND CUDA_OPTS -gencode arch=compute_30,code=sm_30)
LIST(APPEND CUDA_OPTS -gencode arch=compute_35,code=sm_35)
LIST(APPEND CUDA_OPTS -gencode arch=compute_50,code=sm_50)
LIST(APPEND CUDA_OPTS -gencode arch=compute_52,code=sm_52)

produce PTX in the final binary.

So add this:

LIST(APPEND CUDA_OPTS -gencode arch=compute_52,code=compute_52)

to the end of the above list

@txbob Thanks a lot and my code works now. But it looks like that the code does not work as fast as it should be. It’s a multi-threaded server side program and the GPU util ratio is not stable, sometimes it’s 0 or less than 20. Is this normal? My program works much faster on 980 Ti.

I can’t really comment on a code I haven’t seen.

If your program runs faster on GTX 980 than GTX 1080 then I would do two things:

  1. Wait for CUDA 8 RC to come out (should be soon), which I think will be able to compile directly for cc6.1 (arch=compute_61,code=sm_61) which is what a GTX 1080 is. Then re-test.

  2. If your code produces the correct result but still runs more slowly after step 1, I would file a bug.

It would facilitate the process if you could produce a test version of your code that is not multi-threaded. This should be more-or-less orthogonal to CUDA performance anyway, so presumably you can demonstrate a single-threaded version that also displays the same disparity as the multi-threaded version.