Features:

  • C/C++ compiler
  • Visual Profiler
  • GPU-accelerated BLAS library
  • GPU-accelerated FFT library
  • GPU-accelerated Sparse Matrix library
  • GPU-accelerated RNG library
  • Additional tools and documentation

Highlights:

  • Easier Application Porting
    • Share GPUs across multiple threads
    • Employ all GPUs in the system concurrently from a single host thread
    • No-re-create pinning of system memory, a faster alternative to cudaMallocHost()
    • C++ new/delete and back up for virtual functions
    • Support for inline PTX associates
    • Thrust library of templated operation primitives such as sort, reduce, etc.
    • Nvidia Performance Primitives (NPP) library for image/video processing
    • Layered Textures for working with aforementioned size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming
    • Unified Virtual Addressing
    • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools
    • Automatic Performance Assay in Visual Profiler
    • C++ debugging in CUDA-GDB for Linux and MacOS
    • GPU binary disassembler for Fermi architecture (cuobjdump)
    • Parallel Nsight ii.0 now available for Windows developers with new debugging and profiling features.

What'southward New:

This section summarizes the changes in CUDA eleven.two.1 (eleven.2 Update 1) since the 11.two.0 GA release.

CUDA Compiler

Resolved Issues:

  • Previously, when using contempo versions of VS 2019 host compiler, a call to pow(double, int) or pow(float, int) in host or device code sometimes acquired build failures. This upshot has been resolved.

cuSOLVER

New Features:

  • New singular value decomposition (GESVDR) is added. GESVDR computes partial spectrum with random sampling, an lodge of magnitude faster than GESVD.
  • libcusolver.then no longer links libcublas_static.a; instead, it depends on libcublas.so. This reduces the binary size of libcusolver.and then. However, it breaks backward compatibility. The user has to link libcusolver.so with the right version of libcublas.so.

cuSPARSE

New Features:

  • New Tensor Core-accelerated Block Thin Matrix - Matrix Multiplication (cusparseSpMM) and introduction of the Blocked-Ellpack storage format.
  • New algorithms for CSR/COO Sparse Matrix - Vector Multiplication (cusparseSpMV) with amend performance.
  • Extended functionalities for cusparseSpMV:
  • Support for the CSC format.
  • Support for regular/circuitous bfloat16 data types for both uniform and mixed-precision computation.
  • Support for mixed regular-circuitous data type computation.
  • Back up for deterministic and non-deterministic computation.
  • New algorithm (CUSPARSE_SPMM_CSR_ALG3) for Sparse Matrix - Matrix Multiplication (cusparseSpMM) with better operation peculiarly for pocket-sized matrices.
  • New routine for Sampled Dumbo Matrix - Dense Matrix Multiplication (cusparseSDDMM) which deprecated cusparseConstrainedGeMM and provides ameliorate performance.
  • Better accuracy of cusparseAxpby, cusparseRot, cusparseSpVV for bfloat16 and half regular/complex data types.
  • All routines support NVTX annotation for enhancing the profiler time line on circuitous applications.

Deprecations:

  • cusparseConstrainedGeMM has been deprecated in favor of cusparseSDDMM.
  • cusparseCsrmvEx has been deprecated in favor of cusparseSpMV.
  • COO Assortment of Structure (CooAoS) format has been deprecated including cusparseCreateCooAoS, cusparseCooAoSGet, and its back up for cusparseSpMV.

Known Issues:

  • cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroy with NULL statement could crusade segmentation mistake on Windows.

Resolved Problems:

  • cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMV at present support zero-size matrices.
  • cusparseCsr2cscEx2 now correctly handles empty matrices (nnz = 0).
  • cusparseXcsr2csr_compress now uses two-norm for the comparing of circuitous values instead of simply the real part.
  • NPPNew features:New APIs added to compute Altitude Transform using Parallel Banding Algorithm (PBA):
  • nppiDistanceTransformPBA_xxxxx_C1R_Ctx() – where xxxxx specifies the input and output combination: 8u16u, 8s16u, 16u16u, 16s16u, 8u32f, 8s32f, 16u32f, 16s32f
  • nppiSignedDistanceTransformPBA_32f_C1R_Ctx()

Resolved bug:

  • Fixed the issue in which Label Markers adds naught pixel as object region.
  • NVJPEG

New Features:

  • nvJPEG decoder added a new API to support region of interest (ROI) based decoding for batched hardware decoder:
  • nvjpegDecodeBatchedEx()
  • nvjpegDecodeBatchedSupportedEx()

cuFFTKnown Problems:

  • cuFFT planning and plan estimation functions may not restore right context affecting CUDA driver API applications.
  • Plans with strides, primes larger than 127 in FFT size decomposition and total size of transform including strides bigger than 32GB produce incorrect results.

Resolved Issues:

  • Previously, reduced operation of power-of-2 single precision FFTs was observed on GPUs with sm_86 compages. This issue has been resolved.
  • Large prime factors in size decomposition and real to complex or complex to existent FFT type no longer crusade cuFFT plan functions to fail.
  • CUPTIDeprecations early notice:The following functions are scheduled to exist deprecated in 11.three and will exist removed in a future release:
  • NVPW_MetricsContext_RunScript and NVPW_MetricsContext_ExecScript_Begin from the header nvperf_host.h.
  • cuptiDeviceGetTimestamp from the header cupti_events.h

Consummate release notes tin be constitute here.