Tag Archives: AMD Instinct MI100

AMD Instinct MI100 : 11.5 TFLOPS In A Single Card!

AMD Instinct MI100 : 11.5 TFLOPS In A Single Card!

AMD just announced the Instinct MI100 – the world’s fastest HPC GPU accelerator, delivering 11.5 TFLOPS in a single card!

 

AMD Instinct MI100 : 11.5 TFLOPS In A Single Card!

Powered by the new CDNA architecture, the AMD Instinct MI100 is the world’s fastest HPC GPU, and the first to break the 10 TFLOPS FP64 barrier!

Compared to the last-generation AMD accelerators, the AMD Instinct MI100 offers HPC applications almost 3.5X faster performance (FP32 matrix), and AI applications nearly 7X boost in throughput (FP16).

  • up to 11.5 TFLOPS of FP64 performance for HPC
  • up to 46.1 TFLOPS of FP32 Matrix performance for AI and machine learning
  • up to 184.6 TFLOPS of FP16 performance for AI training

2nd Gen AMD Infinity Fabric

It also leverages on the 2nd Gen AMD Infinity Fabric technology to deliver twice the peer-to-peer IO bandwidth of PCI Express 4.0. Thanks to its triple Infinity Fabric Links, it offers up to 340 GB/s of aggregate bandwidth per card.

In a server, MI100 GPUs can be configured as two fully-connected quad GPU hives, each providing up to 552 GB/s of P2P IO bandwidth.

Ultra-Fast HBM2 Memory

The AMD Instinct MI100 comes with 32 GB of HBM2 memory that deliver up to 1.23 TB/s of memory bandwidth to support large datasets.

PCI Express 4.0 Interface

The AMD Instinct MI100 is supports PCI Express 4.0, allowing for up to 64 GB/s of peak bandwidth from CPU to GPU, when paired with 2nd Gen AMD EPYC processors.

AMD Instinct MI100 : Specifications

Specifications AMD Instinct MI100
Fab Process 7 nm
Compute Units 120
Stream Processors 7,680
Peak BFLOAT16
Peak INT4 | INT8
Peak FP16
Peak FP32
Peak FMA32
Peak FP64 | FMA64
92.3 TFLOPS
184.6 TOPS
184.6 TFLOPS
46.1 TFLOPS
23.1 TFLOPS
11.5 TFLOPS
Memory 32 GB HBM2
Memory Interface 4,096 bits
Memory Clock 1.2 GHz
Memory Bandwidth 1.2 TB/s
Reliability Full Chip ECC
RAS Support
Scalability 3 x Infinity Fabric Links
OS Support Linux 64-bit
Bus Interface PCIe Gen 3 / Gen 4
Board Form Factor Full Height, Dual Slot
Board Length 10.5-inch long
Cooling Passively Cooled
Max Board Power 300 W TDP
Warranty 3-Years Limited

 

AMD Instinct MI100 : Availability

The AMD Instinct MI100 will be available in systems by the end of 2020 from OEM/ODM partners like Dell, Gigabyte, Hewlett Packard Enterprise (HPE), and Supermicro.

 

Recommended Reading

Go Back To > Enterprise ITComputer HardwareHome

 

Support Tech ARP!

If you like our work, you can help support us by visiting our sponsors, participating in the Tech ARP Forums, or even donating to our fund. Any help you can render is greatly appreciated!


AMD CDNA Architecture : Tech Highlights!

In addition to the gaming-centric RDNA architecture, AMD just introduced a new CDNA architecture that is optimised for compute workloads.

Here are some key tech highlights of the new AMD CDNA architecture!

 

AMD CDNA Architecture : What Is It?

Unlike the fixed-function graphics accelerators of the past, GPUs are now fully-programmable accelerators using what’s called the GPGPU (General Purpose GPU) Architecture.

GPGPU allowed the industry to leverage their tremendous processing power for machine learning and scientific computing purposes.

Instead of continuing down the GPGPU path, AMD has decided to introduce two architectures :

  • AMD RDNA : optimised for gaming to maximise frames per second
  • AMD CDNA : optimised for compute workloads to maximise FLOPS per second.

Designed to accelerate compute workloads, AMD CDNA augments scalar and vector processing with new Matrix Core Engines, and adds Infinity Fabric technology for scale-up capability.

This allows the first CDNA-based accelerator – AMD Instinct MI100 – to break the 10 TFLOPS per second (FP64) barrier.

The GPU is connected to its host processor using a PCI Express 4.0 interface, that delivers up to 32 GB/s of bandwidth in both directions.

 

AMD CDNA Architecture : Compute Units

The command processor and scheduling logic receives API-level commands and translates them into compute tasks.

These compute tasks are implemented as compute arrays and managed by the four Asynchronous Compute Engines (ACE), which maintain their independent stream of commands to the compute units.

Its 120 compute units (CUs) are derived from the earlier GCN architecture, and organised into four compute engines that execute wavefronts that contain 64 work-items.

The CUs are, however, enhanced with new Matrix Core Engines, that are optimised for matrix data processing.

Here is the block diagram of the AMD Instinct MI100 accelerator, showing how its main blocks are all tied together with the on-die Infinity Fabric.

Unlike the RDNA architecture, CDNA removes all of the fixed-function graphics hardware for tasks like rasterisation, tessellation, graphics caches, blending and even the display engine.

CDNA retains the dedicated logic for HEVC, H.264 and VP9 decoding that is sometimes used for compute workloads that operate on multimedia data.

The new Matrix Core Engines add a new family of wavefront-level instructions – the Matrix Fused Multiply-Add (MFMA). The MFMA instructions perform mixed-precision arithmetic and operates on KxN matrices using four different types of input data :

  • INT8 – 8-bit integers
  • FP16 – 16-bit half-precision
  • bf16 – 16-bit brain FP
  • FP32 – 32-bit single-precision

The new Matrix Core Engines has several advantages over the traditional vector pipelines in GCN :

  • the execution unit reduces the number of register file reads, since many input values are reused in a matrix multiplication
  • narrower datatypes create opportunity for workloads that do not require full FP32 precision, e.g. machine learning – saving energy.

 

AMD CDNA Architecture : L2 Cache + Memory

Most scientific and machine learning data sets are gigabytes or even terabytes in size. Therefore L2 cache and memory performance is critical.

In CDNA, the L2 cache is shared across the entire chip, and physically partitioned into multiple slices.

The MI100, specifically, has an 8 MB cache that is 16-way set-associative and made up of 32 slices. Each slice can sustain 128 bytes for an aggregate bandwidth of over 6 TB/s across the GPU.

The CDNA memory controller can drive 4- or 8-stacks high of HBM2 memory at 2.4 GT/s for a maximum throughput of 1.23 TB/s.

The memory contents are also protected by hardware ECC.

 

AMD CDNA Architecture : Communication + Scaling

CDNA is also designed for scaling up, using the high-speed Infinity Fabric technology to connect multiple GPUs.

AMD Infinity Fabric links are 16-bits wide, and operate at 23 GT/s, with three links in CDNA to allow for full connectivity in a quad-GPU configuration.

While the last generation Radeon Instinct MI50 GPU only uses a ring topology, the new fully-connected Infinity Fabric topology boosts performance for common communication patterns like all-reduce and scatter / gather.

Unlike PCI Express, Infinity Fabric links support coherent GPU memory, which lets multiple GPUs share an address space and tightly work on a single task.

 

Recommended Reading

Go Back To > Enterprise ITComputer HardwareHome

 

Support Tech ARP!

If you like our work, you can help support us by visiting our sponsors, participating in the Tech ARP Forums, or even donating to our fund. Any help you can render is greatly appreciated!