NVIDIA Resources¶
For the topics you wish to explore further, start by consulting the official documentation, such as the CUDA C++ Programming Guide if you're aiming to become a "CUDA Ninja". However, official documentation can sometimes be challenging to understand. In such cases, searching for "NVIDIA Blog" and "GTC Talks" on relevant topics through Google often leads to excellent presentations and discussions on the subject. Also make sure to prioritize blogs and talks from recent years, as they are more likely to cover the latest technologies.
The following are the reference materials for further study.
NVIDIA's Latest Hardware¶
- GB200 NVL72
- Grace Hopper & NVLink-C2C & Memory Coherent Architecture
- GPUDirect Storage & RDMA
- DGX SuperPOD
CUDA & Optimization¶
- CUDA C++ Programming Guide
- [GitHub] NVIDIA/cuda-samples
- Coalesced Memory Access
- Shared Memory Bank Conflict
- CUDA Streams
- [Blog] How to Optimize Data Transfers in CUDA C/C++
- [Blog] How to Overlap Data Transfers in CUDA C/C++
- [Blog] GPU Pro Tip: CUDA 7 Streams Simplify Concurrency
- [Blog] Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1
- [Blog] Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2
- 9.2.1.4. Streams and Events | CUDA C++ Programming Guide
- 11. Stream Ordered Memory Allocator | CUDA C++ Programming Guide
- Atomics
- Vectorized Memory Access
- Warp-Level Primitives
Advanced CUDA¶
- Fat Binary & PTX (Parallel Thread Execution) & SASS (Streaming Assembly)
- Cooperative Groups (CG)
- LDGSTS
Require Ampere architecture or later.
- [Blog] NVIDIA Ampere Architecture In-Depth
- [Blog] Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture
- [Blog] Boosting Application Performance with GPU Memory Prefetching
- 7.26. Asynchronous Barrier | CUDA C++ Programming Guide
- 7.27. Asynchronous Data Copies | CUDA C++ Programming Guide
- 7.28. Asynchronous Data Copies using
cuda::pipeline
- Hopper Architecture & Thread Block Cluster & Distributed Shared Memory (DSMEM) & Tensor Memory Accelerator (TMA)
Require Hopper architecture or later.
- CUDA Graphs
- [Blog] Getting Started with CUDA Graphs
- [Blog] Employing CUDA Graphs in a Dynamic Environment
- [Blog] Constructing CUDA Graphs with Dynamic Parameters
- [Blog] Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch
- [Blog] Dynamic Control Flow in CUDA Graphs with Conditional Nodes
- [Blog] Optimizing llama.cpp AI Inference with CUDA Graphs
- 3.2.8.7. CUDA Graphs | CUDA C++ Programming Guide
NVIDIA HPC SDK¶
- ISO C++/Fortran, OpenACC/OpenMP, CUDA
- Python
- [Blog] RAPIDS cuDF Accelerates pandas Nearly 150x with Zero Code Changes
- [Blog] Faster HDBSCAN Soft Clustering with RAPIDS cuML
- [Blog] Unifying the CUDA Python Ecosystem
- [Blog] Effortlessly Scale NumPy from Laptops to Supercomputers with NVIDIA cuPyNumeric
- [Blog] Creating Differentiable Graphics and Physics Simulation in Python with NVIDIA Warp
- NVIDIA CUDA-X Libraries
- CUDA C++ Core Libraries and
stdexec
- NVSHMEM
- NVIDIA Collective Communications Library (NCCL)
- NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
Dev Tools¶
- Compute Sanitizer
- Nsight Systems/Compute/Graphics & Nsight Visual Studio Code Edition
- Containerization
Higher-Level SDKs¶
- NVIDIA NeMo
- NVIDIA Omniverse & Isaac
- NVIDIA PhysicsNeMo (formerly NVIDIA Modulus)
Other GTC Talks¶
- GTC 24 - Advanced Performance Optimization in CUDA - S62192
- GTC 24 - CUDA, New Features and Beyond - S62400
- GTC 24 - How To Write A CUDA Program, The Ninja Edition - S62401
- GTC 24 - Introduction to CUDA Programming and Performance Optimization - S62191
- GTC 24 - Multi GPU Programming Models for HPC and AI - S61339
- GTC 24 - Warp, Advancing Simulation AI with Differentiable GPU Computing in Python - S63345
- GTC Spring 23 - Connect with the Experts, C++ Standard Parallelism and C++ Core Compute Libraries - CWES52064
- GTC Spring 23 - C++ Standard Parallelism - S51755
- GTC Spring 23 - CUDA, New Features and Beyond - S51225
- GTC Spring 23 - Robust and Efficient CUDA C++ Concurrency with Stream-Ordered Allocation - S51897
- GTC Spring 22 - C++ Standard Parallelism - S41960
- GTC Spring 22 - How CUDA Programming Works - S41487
- GTC Spring 22 - How to Understand and Optimize Shared Memory Accesses using Nsight Compute - S41723
- GTC Fall 21 - Accelerate Computing with CUDA Python - A31138
- GTC Fall 21 - GPU Acceleration in Python using CuPy and Numba - A31149
- GTC Fall 21 - Legate, Scaling the Python Ecosystem - A31168
- GTC Spring 21 - How GPU Computing Works - S31151
Epilogue¶
Last updated on 2024-12-08.