In addition to the gaming-centric RDNA architecture, AMD just introduced a new CDNA architecture that is optimised for compute workloads.
Here are some key tech highlights of the new AMD CDNA architecture!
AMD CDNA Architecture : What Is It?
Unlike the fixed-function graphics accelerators of the past, GPUs are now fully-programmable accelerators using what’s called the GPGPU (General Purpose GPU) Architecture.
GPGPU allowed the industry to leverage their tremendous processing power for machine learning and scientific computing purposes.
Instead of continuing down the GPGPU path, AMD has decided to introduce two architectures :
- AMD RDNA : optimised for gaming to maximise frames per second
- AMD CDNA : optimised for compute workloads to maximise FLOPS per second.
Designed to accelerate compute workloads, AMD CDNA augments scalar and vector processing with new Matrix Core Engines, and adds Infinity Fabric technology for scale-up capability.
This allows the first CDNA-based accelerator – AMD Instinct MI100 – to break the 10 TFLOPS per second (FP64) barrier.
The GPU is connected to its host processor using a PCI Express 4.0 interface, that delivers up to 32 GB/s of bandwidth in both directions.
AMD CDNA Architecture : Compute Units
The command processor and scheduling logic receives API-level commands and translates them into compute tasks.
These compute tasks are implemented as compute arrays and managed by the four Asynchronous Compute Engines (ACE), which maintain their independent stream of commands to the compute units.
Its 120 compute units (CUs) are derived from the earlier GCN architecture, and organised into four compute engines that execute wavefronts that contain 64 work-items.
The CUs are, however, enhanced with new Matrix Core Engines, that are optimised for matrix data processing.
Here is the block diagram of the AMD Instinct MI100 accelerator, showing how its main blocks are all tied together with the on-die Infinity Fabric.
Unlike the RDNA architecture, CDNA removes all of the fixed-function graphics hardware for tasks like rasterisation, tessellation, graphics caches, blending and even the display engine.
CDNA retains the dedicated logic for HEVC, H.264 and VP9 decoding that is sometimes used for compute workloads that operate on multimedia data.
The new Matrix Core Engines add a new family of wavefront-level instructions – the Matrix Fused Multiply-Add (MFMA). The MFMA instructions perform mixed-precision arithmetic and operates on KxN matrices using four different types of input data :
- INT8 – 8-bit integers
- FP16 – 16-bit half-precision
- bf16 – 16-bit brain FP
- FP32 – 32-bit single-precision
The new Matrix Core Engines has several advantages over the traditional vector pipelines in GCN :
- the execution unit reduces the number of register file reads, since many input values are reused in a matrix multiplication
- narrower datatypes create opportunity for workloads that do not require full FP32 precision, e.g. machine learning – saving energy.
AMD CDNA Architecture : L2 Cache + Memory
Most scientific and machine learning data sets are gigabytes or even terabytes in size. Therefore L2 cache and memory performance is critical.
In CDNA, the L2 cache is shared across the entire chip, and physically partitioned into multiple slices.
The MI100, specifically, has an 8 MB cache that is 16-way set-associative and made up of 32 slices. Each slice can sustain 128 bytes for an aggregate bandwidth of over 6 TB/s across the GPU.
The CDNA memory controller can drive 4- or 8-stacks high of HBM2 memory at 2.4 GT/s for a maximum throughput of 1.23 TB/s.
The memory contents are also protected by hardware ECC.
AMD CDNA Architecture : Communication + Scaling
CDNA is also designed for scaling up, using the high-speed Infinity Fabric technology to connect multiple GPUs.
AMD Infinity Fabric links are 16-bits wide, and operate at 23 GT/s, with three links in CDNA to allow for full connectivity in a quad-GPU configuration.
While the last generation Radeon Instinct MI50 GPU only uses a ring topology, the new fully-connected Infinity Fabric topology boosts performance for common communication patterns like all-reduce and scatter / gather.
Unlike PCI Express, Infinity Fabric links support coherent GPU memory, which lets multiple GPUs share an address space and tightly work on a single task.
Recommended Reading
- AMD Instinct MI100 : 11.5 TFLOPS In A Single Card!
- NVIDIA RTX A6000 + Omniverse : Specs + Details!
- VMware vSphere 7 Now Supports AMD SEV-ES Encryption!
- Desktop Graphics Card Comparison Guide Rev. 37.3
- AMD Ryzen 9 5950X In-Depth Review : 16-Core Behemoth!
- AMD Ryzen 7 5800X In-Depth Review : 8-Core Powerhouse!
- AMD Ryzen 5 5600X In-Depth Review : A Leap Forward!
- AMD Ryzen 5000 CPUs : Malaysia Price List + FREE Game!
- AMD Infinity Cache Explained : L3 Cache Comes To The GPU!
- Google Cloud Confidential VM With 2nd Gen AMD EPYC!
- Radeon RX 6000 vs GeForce RTX 30 : Faster + Cheaper!
- RX 6900 XT, RX 6800 XT, RX 6800 : Features + Specifications!
- AMD Radeon RX 6000 Series : What You Need To Know!
- Dell G5 15 5500 Review : Affordable RTX Gaming @ 144 Hz!
- AMD Ryzen 5000 Series : What You Need To Know!
- XPS 13 (9310) with Intel Tiger Lake : What’s New?
- VMware vSphere 7 Now Supports AMD SEV-ES Encryption!
- Ryzen 7 3700C | Ryzen 5 3500C | Ryzen 3 3250C Revealed!
- AMD Athlon Gold 3150C + Athlon Silver 3050C Revealed!
- 11th Gen Intel Core (Tiger Lake) : What You Need To Know!
- HUAWEI MateBook X Pro (2020) Review : Ultra-Light Beast!
- AMD Ryzen PRO 4000 Desktop APUs : All You Need To Know!
- AMD Athlon PRO 3000 Desktop APUs : All You Need To Know!
- AMD Athlon 3000 G-Series with Radeon Graphics Revealed!
- AMD Ryzen 4000 G-Series with Radeon Graphics Revealed!
- Google Cloud Confidential VM With 2nd Gen AMD EPYC!
Go Back To > Enterprise IT | Computer Hardware | Home
Support Tech ARP!
If you like our work, you can help support us by visiting our sponsors, participating in the Tech ARP Forums, or even donating to our fund. Any help you can render is greatly appreciated!