AMD just announced the Instinct MI100 – the world’s fastest HPC GPU accelerator, delivering 11.5 TFLOPS in a single card!
AMD Instinct MI100 : 11.5 TFLOPS In A Single Card!
Powered by the new CDNA architecture, the AMD Instinct MI100 is the world’s fastest HPC GPU, and the first to break the 10 TFLOPS FP64 barrier!
Compared to the last-generation AMD accelerators, the AMD Instinct MI100 offers HPC applications almost 3.5X faster performance (FP32 matrix), and AI applications nearly 7X boost in throughput (FP16).
up to 11.5 TFLOPS of FP64 performance for HPC
up to 46.1 TFLOPS of FP32 Matrix performance for AI and machine learning
up to 184.6 TFLOPS of FP16 performance for AI training
2nd Gen AMD Infinity Fabric
It also leverages on the 2nd Gen AMD Infinity Fabric technology to deliver twice the peer-to-peer IO bandwidth of PCI Express 4.0. Thanks to its triple Infinity Fabric Links, it offers up to 340 GB/s of aggregate bandwidth per card.
In a server, MI100 GPUs can be configured as two fully-connected quad GPU hives, each providing up to 552 GB/s of P2P IO bandwidth.
Ultra-Fast HBM2 Memory
The AMD Instinct MI100 comes with 32 GB of HBM2 memory that deliver up to 1.23 TB/s of memory bandwidth to support large datasets.
PCI Express 4.0 Interface
The AMD Instinct MI100 is supports PCI Express 4.0, allowing for up to 64 GB/s of peak bandwidth from CPU to GPU, when paired with 2nd Gen AMD EPYC processors.
In addition to the gaming-centric RDNA architecture, AMD just introduced a new CDNA architecture that is optimised for compute workloads.
Here are some key tech highlights of the new AMD CDNA architecture!
AMD CDNA Architecture : What Is It?
Unlike the fixed-function graphics accelerators of the past, GPUs are now fully-programmable accelerators using what’s called the GPGPU (General Purpose GPU) Architecture.
GPGPU allowed the industry to leverage their tremendous processing power for machine learning and scientific computing purposes.
Instead of continuing down the GPGPU path, AMD has decided to introduce two architectures :
AMD RDNA : optimised for gaming to maximise frames per second
AMD CDNA : optimised for compute workloads to maximise FLOPS per second.
Designed to accelerate compute workloads, AMD CDNA augments scalar and vector processing with new Matrix Core Engines, and adds Infinity Fabric technology for scale-up capability.
This allows the first CDNA-based accelerator – AMD Instinct MI100 – to break the 10 TFLOPS per second (FP64) barrier.
The GPU is connected to its host processor using a PCI Express 4.0 interface, that delivers up to 32 GB/s of bandwidth in both directions.
AMD CDNA Architecture : Compute Units
The command processor and scheduling logic receives API-level commands and translates them into compute tasks.
These compute tasks are implemented as compute arrays and managed by the four Asynchronous Compute Engines (ACE), which maintain their independent stream of commands to the compute units.
Its 120 compute units (CUs) are derived from the earlier GCN architecture, and organised into four compute engines that execute wavefronts that contain 64 work-items.
The CUs are, however, enhanced with new Matrix Core Engines, that are optimised for matrix data processing.
Here is the block diagram of the AMD Instinct MI100 accelerator, showing how its main blocks are all tied together with the on-die Infinity Fabric.
Unlike the RDNA architecture, CDNA removes all of the fixed-function graphics hardware for tasks like rasterisation, tessellation, graphics caches, blending and even the display engine.
CDNA retains the dedicated logic for HEVC, H.264 and VP9 decoding that is sometimes used for compute workloads that operate on multimedia data.
The new Matrix Core Engines add a new family of wavefront-level instructions – the Matrix Fused Multiply-Add (MFMA). The MFMA instructions perform mixed-precision arithmetic and operates on KxN matrices using four different types of input data :
INT8 – 8-bit integers
FP16 – 16-bit half-precision
bf16 – 16-bit brain FP
FP32 – 32-bit single-precision
The new Matrix Core Engines has several advantages over the traditional vector pipelines in GCN :
the execution unit reduces the number of register file reads, since many input values are reused in a matrix multiplication
narrower datatypes create opportunity for workloads that do not require full FP32 precision, e.g. machine learning – saving energy.
AMD CDNA Architecture : L2 Cache + Memory
Most scientific and machine learning data sets are gigabytes or even terabytes in size. Therefore L2 cache and memory performance is critical.
In CDNA, the L2 cache is shared across the entire chip, and physically partitioned into multiple slices.
The MI100, specifically, has an 8 MB cache that is 16-way set-associative and made up of 32 slices. Each slice can sustain 128 bytes for an aggregate bandwidth of over 6 TB/s across the GPU.
The CDNA memory controller can drive 4- or 8-stacks high of HBM2 memory at 2.4 GT/s for a maximum throughput of 1.23 TB/s.
The memory contents are also protected by hardware ECC.
AMD CDNA Architecture : Communication + Scaling
CDNA is also designed for scaling up, using the high-speed Infinity Fabric technology to connect multiple GPUs.
AMD Infinity Fabric links are 16-bits wide, and operate at 23 GT/s, with three links in CDNA to allow for full connectivity in a quad-GPU configuration.
While the last generation Radeon Instinct MI50 GPU only uses a ring topology, the new fully-connected Infinity Fabric topology boosts performance for common communication patterns like all-reduce and scatter / gather.
Unlike PCI Express, Infinity Fabric links support coherent GPU memory, which lets multiple GPUs share an address space and tightly work on a single task.
In partnership with Intel, Dell Technologies announced the launch of five Dell AI Experience Zones across the APJ region!
Here is a quick primer on the new Dell AI Experience Zones, and what they mean for organisations in the APJ region!
The APJ Region – Ripe For Artificial Intelligence
According to the Dell Technologies Digital Transformation Index, Artificial Intelligence (AI) will be amongst the top spending priorities for business leaders in APJ.
Half of those surveyed plan to invest in AI in the next one to three years, as part of their digital transformation strategy. However, 95% of companies face a lack of in-house expertise in AI.
This is where the five new Dell AI Experience Zones come in…
The Dell AI Experience Zones
The new AI Experience Zones are designed to offer both customers and partners a comprehensive look at the latest AI technologies and solutions.
Built into the existing Dell Technologies Customer Solution Centres, they will showcase how the Dell EMC High-Performance Computing (HPC) and AI ecosystem can help them address business challenges and seize opportunities.
All five AI Experience Zones are equipped with technology demonstrations built around the latest Dell EMC PowerEdge servers. Powered by the latest Intel Xeon Scalable processors, they are paired with advanced, open-source AI software like VINO, as well as Dell EMC networking and storage technologies.
Customers and partners who choose to leverage the new AI Experience Zones will receive help in kickstarting their AI initiatives, from design and AI expert engagements, to masterclass training, installation and maintenance.
“The timely adoption of AI will create new opportunities that will deliver concrete business advantages across all industries and business functions,” says Chris Kelly, vice president, Infrastructure Solutions Group, Dell Technologies, APJ.
“Companies looking to thrive in a data drive era need to understand that investments in AI are no longer optional – they are business critical. Whilst complex in nature, it is imperative that companies quickly start moving from theoretical AI strategies to practical deployments to stay ahead of the curve.”
Dell AI Experience Zones In APJ
The five new AI Experience Zones that Dell Technologies and Intel announced are located within the Dell Technologies Customer Solution Centres in these cities :
NVIDIA CEO Jensen Huang (recently anointed as Fortune 2017 Businessperson of the Year) made as surprise reveal at the NIPS conference – the NVIDIA TITAN V. This is the first desktop graphics card to be built on the latest NVIDIA Volta microarchitecture, and the first to use HBM2 memory.
In this article, we will share with you everything we know about the NVIDIA TITAN V, and how it compares against its TITANic predecessors. We will also share with you what we think could be a future NVIDIA TITAN Vp graphics card!
Updated @ 2017-12-10 : Added a section on gaming with the NVIDIA TITAN V .
Originally posted @ 2017-12-09
NVIDIA Volta isn’t exactly new. Back in GTC 2017, NVIDIA revealed NVIDIA Volta, the NVIDIA GV100 GPU and the first NVIDIA Volta-powered product – the NVIDIA Tesla V100. Jensen even highlighted the Tesla V100 in his Computex 2017 keynote, more than 6 months ago!
Yet there has been no desktop GPU built around NVIDIA Volta. NVIDIA continued to churn out new graphics cards built around the Pascal architecture – GeForce GTX 1080 Ti and GeForce GTX 1070 Ti. That changed with the NVIDIA TITAN V.
The NVIDIA GV100 is the first NVIDIA Volta-based GPU, and the largest they have ever built. Even using the latest 12 nm FFN (FinFET NVIDIA) process, it is still a massive chip at 815 mm²! Compare that to the GP100 (610 mm² @ 16 nm FinFET) and GK110 (552 mm² @ 28 nm).
That’s because the GV100 is built using a whooping 21.1 billion transistors. In addition to 5376 CUDA cores and 336 Texture Units, it boasts 672 Tensor cores and 6 MB of L2 cache. All those transistors require a whole lot more power – to the tune of 300 W.
The NVIDIA TITAN V
That’s V for Volta… not the Roman numeral V or V for Vendetta. Powered by the NVIDIA GV100 GPU, the TITAN V has 5120 CUDA cores, 320 Texture Units, 640 Tensor cores, and a 4.5 MB L2 cache. It is paired with 12 GB of HBM2 memory (3 x 4GB stacks) running at 850 MHz.
The blowout picture of the NVIDIA TITAN V reveals even more details :
It has 3 DisplayPorts and one HDMI port.
It has 6-pin + 8-pin PCIe power inputs.
It has 16 power phases, and what appears to be the Founders Edition copper heatsink and vapour chamber cooler, with a gold-coloured shroud.
There is no SLI connector, only what appears to be an NVLink connector.
Here are more pictures of the NVIDIA TITAN V, courtesy of NVIDIA.
Can You Game On The NVIDIA TITAN V? New!
Right after Jensen announced the TITAN V, the inevitable question was raised on the Internet – can it run Crysis / PUBG?
The NVIDIA TITAN V is the most powerful GPU for the desktop PC, but that does not mean you can actually use it to play games. NVIDIA notably did not mention anything about gaming, only that the TITAN V is “ideal for developers who want to use their PCs to do work in AI, deep learning and high performance computing.”
In fact, the TITAN V is not listed in their GeForce Gaming section. The most powerful graphics card in the GeForce Gaming section remains the TITAN Xp.
Then again, the TITAN V uses the same NVIDIA Game Ready Driver as GeForce gaming cards, starting with version 388.59. Even so, it is possible that some or many games may not run well or properly on the TITAN V.
Of course, all this is speculative in nature. All that remains to crack this mystery is for someone to buy the TITAN V and use it to play some games!
If you like our work, you can help support our work by visiting our sponsors, participating in the Tech ARP Forums, or even donating to our fund. Any help you can render is greatly appreciated!
The NVIDIA TITAN V Specification Comparison
Let’s take a look at the known specifications of the NVIDIA TITAN V, compared to the TITAN Xp (launched earlier this year), and the TITAN X (launched late last year). We also inserted the specifications of a hypotheticalNVIDIA TITAN Vp, based on a full GV100.
Future TITAN Vp?
NVIDIA TITAN V
NVIDIA TITAN Xp
NVIDIA TITAN X
12 nm FinFET+
12 nm FinFET+
16 nm FinFET
16 nm FinFET
L2 Cache Size
GPU Core Clock
GPU Boost Clock
Multi GPU Capability
The NVIDIA TITAN Vp?
In case you are wondering, the TITAN Vp does not exist. It is merely a hypothetical future model that we think NVIDIA may introduce mid-cycle, like the NVIDIA TITAN Xp.
Our TITAN Vp is based on the full capabilities of the NVIDIA GV100 GPU. That means it will have 5376 CUDA cores with 336 Texture Units, 672 Tensor cores and 6 MB of L2 cache. It will also have a higher TDP of 300 watts.
The Official NVIDIA TITAN V Press Release
December 9, 2017—NVIDIA today introduced TITAN V, the world’s most powerful GPU for the PC, driven by the world’s most advanced GPU architecture, NVIDIA Volta .
Announced by NVIDIA founder and CEO Jensen Huang at the annual NIPS conference, TITAN V excels at computational processing for scientific simulation. Its 21.1 billion transistors deliver 110 teraflops of raw horsepower, 9x that of its predecessor, and extreme energy efficiency.
“Our vision for Volta was to push the outer limits of high performance computing and AI. We broke new ground with its new processor architecture, instructions, numerical formats, memory architecture and processor links,” said Huang. “With TITAN V, we are putting Volta into the hands of researchers and scientists all over the world. I can’t wait to see their breakthrough discoveries.”
NVIDIA Supercomputing GPU Architecture, Now for the PC
TITAN V’s Volta architecture features a major redesign of the streaming multiprocessor that is at the center of the GPU. It doubles the energy efficiency of the previous generation Pascal design, enabling dramatic boosts in performance in the same power envelope.
New Tensor Cores designed specifically for deep learning deliver up to 9x higher peak teraflops. With independent parallel integer and floating-point data paths, Volta is also much more efficient on workloads with a mix of computation and addressing calculations. Its new combined L1 data cache and shared memory unit significantly improve performance while also simplifying programming.
Fabricated on a new TSMC 12-nanometer FFN high-performance manufacturing process customised for NVIDIA, TITAN V also incorporates Volta’s highly tuned 12GB HBM2 memory subsystem for advanced memory bandwidth utilisation.
Free AI Software on NVIDIA GPU Cloud
TITAN V’s incredible power is ideal for developers who want to use their PCs to do work in AI, deep learning and high performance computing.
Users of TITAN V can gain immediate access to the latest GPU-optimised AI, deep learning and HPC software by signing up at no charge for an NVIDIA GPU Cloud account. This container registry includes NVIDIA-optimised deep learning frameworks, third-party managed HPC applications, NVIDIA HPC visualisation tools and the NVIDIA TensorRT inferencing optimiser.
KUALA LUMPUR, April 12, 2017 – MellanoxTechnologies, Ltd. (NASDAQ: MLNX) today unveiled its expansion plans for Malaysia. The announcement, which is in line with the country’s ambitions of becoming the leading Big Data Analytics (BDA) solutions hub in South East Asia, reiterated Mellanox’s commitment to Malaysia through its strategic investment roadmap.
Mellanox Technologies Expands Presence In Malaysia
“Malaysia’s investment in Big Data, data centers and the Cloud is impressive,” said Charlie Foo, Vice President and General Manager, Asia Pacific Japan, Mellanox Technologies. “With a year-over-year growth of more than 20 percent in the last five years, the field of digital data management is maturing rapidly. Mellanox’s investment in Malaysia looks to complement Malaysia’s advancing digital economy by providing intelligent 10, 25, 40, 50 and 100Gb/s interconnect solutions that serve today’s and future needs in Malaysia. This will enable organizations to be less concerned about today’s technological demands while concentrating on running their business, resulting in unparalleled operating efficiency for these organizations.”
Mellanox’s investment into Malaysia’s digital economy comes at a time when the country is ramping up its efforts to see its ICT roadmap to fruition. The country’s ICT custodian, Malaysia Digital Economy Corporation (MDEC), noted that MSC Malaysia — a national initiative designed to attract world-class technology companies to the country — reported a U.S. $3.88 billion in export sales in 2015, representing an 18 percent increase over 2014.
Today, the MSC Malaysia footprint has expanded to include 42 locations across the country, hosting more than 3,800 companies from more than 40 countries, employing more than 150,000 high-income knowledge workers, 85 percent which are Malaysians. This has propelled Malaysia to a top three ranking in AT Kearney’s Global Services Location Index since 2005, with only China and India ahead of Malaysia.
Mellanox’s Open Ethernet switch family delivers the highest performance and port density with a complete chassis and fabric management solution, enabling converged data centers to operate at any scale while reducing operational costs and infrastructure complexity.
Mellanox InfiniBand solutions have already been chosen to accelerate large High Performance Computing (HPC) customers in Malaysia. HPC customers use super computers and parallel processing techniques for solving complex computational problems and performing research activities through computer modeling, simulation and analysis. These HPC customers span various industries including education, bioscience, governments, finance, media and entertainment, oil and gas, pharmaceutical and manufacturing.
The company is actively seeking partnerships and collaboration opportunities to support customers from different industries, primarily within Big Data, data centers and the Cloud.
Support Tech ARP!
If you like our work, you can help support our work by visiting our sponsors, participating in the Tech ARP Forums, or even donating to our fund. Any help you can render is greatly appreciated!
Kuala Lumpur, 4 July 2016 – Dell has announced advancements to its high performance computing (HPC) portfolio, including the availability of new Dell HPC Systems and technology partner collaborations for early access to innovative HPC technologies.
“While traditional HPC has been critical to research programs that enable scientific and societal advancement, Dell is mainstreaming these capabilities to support enterprises of all sizes as they seek a competitive advantage in an ever increasing digital world,” said William Tan, head of Enterprise Solutions, Dell Malaysia. “As a clear leader in HPC, Dell now offers customers highly flexible, precision built HPC systems for multiple vertical industries based upon years of experience powering the world’s most advanced academic and research institutions. With Dell HPC Systems, our customers can deploy HPC systems more quickly and cost effectively and accelerate their speed of innovation to deliver both breakthroughs and business results.”
Dell HPC Systems Portfolio Simplifies Powerful, Traditional HPC System for Enterprises of All Sizes
Available in Malaysia and globally, the Dell HPC Systems portfolio is a family of HPC and data analytics solutions that combine the flexibility of customised HPC systems with the speed, simplicity and reliability of pre-configured systems. Dell engineers and domain experts designed and tuned the new systems for specific science, manufacturing and analytics workloads with fully tested and validated building block systems, backed by a single point of hardware support and additional service options across the solution lifecycle.
With simplified configuration and ordering, organisations can more quickly select and deploy updated Dell HPC Systems at any scale today. As an Intel Scalable System Framework configuration, these systems, available today, include the latest Intel Xeon processor families, support for Intel Omni-Path Architecture (Intel OPA) fabric, and software in the Dell HPC Lustre Storage and Dell HPC NFS Storage solutions:
Dell HPC System for Life Sciences – Designed to meet the needs of life sciences organisations, this enables bioinformatics and genomics centers to deliver results and identify treatments in clinically relevant timeframes while maintaining compliance and protecting confidential data.
Dell HPC System for Manufacturing –Enables manufacturing and engineering customers to run complex design simulations, including structural analysis and computational fluid dynamics.
Dell HPC System for Research – Enables research centers to quickly develop HPC systems that match the unique needs of a wide variety of workloads, involving complex scientific analysis.
Dell Leads HPC Technology Advancements with Industry Partners to Help Accelerate Customer Innovation Cycles
Dell has instituted a customer early access program for early development and testing in preparation for Dell’s next server offering in the HPC solutions portfolio, the Dell PowerEdge C6320p server, which will be available in the second half of 2016, with the Intel Xeon Phi processor (formerly code-named Knights Landing). The PowerEdge C6320p unique server engineering and design will enable customers to:
Gain insights faster with a modular building block design, engineered to deliver faster insights for data-intensive computations and scale-up parallel processing.
Accelerate performance in dense and highly parallel HPC environments with 72 cores that are specifically optimised for parallel computing.
Simplify and automate systems management with the integrated Dell Remote Access Controller 8 (iDRAC8) with Lifecycle Controller. Customers can deploy, monitor and update PowerEdge C6320p servers faster and ensure higher levels of service and availability.
The Texas Advanced Computing Center (TACC) at The University of Texas at Austin has partnered with Dell and Intel to deploy an upgrade to its Stampede supercomputing cluster with Intel Xeon Phi processors and Intel OPA via Dell’s early access program.
Stampede, one of the main clusters for the Extreme Science and Engineering Discovery Environment (XSEDE),is a multi-use, cyberinfrastructure resource offering large memory, large data transfer, and GPU capabilities for data-intensive, accelerated or visualisation computing for thousands of projects ranging from cancer cure research to severe weather modeling.
This month, the U.S. National Science Foundation awarded US$30 million to TACC to acquire and deploy Stampede 2 as a strategic national resource to provide HPC capabilities for thousands of researchers in the U.S. The new Dell HPC System is expected to deliver a peak performance of up to 18 petaflops, more than twice the system performance of the current Stampede system. Three and a half years since its installation, Stampede ranks as the 12th most powerful supercomputer in the world, according to the June 2016 TOP500 list.
Additionally, Dell continues to bring HPC capabilities to mainstream enterprises through a series of evolving solutions and services designed to deliver a range of HPC-as-a-Service capabilities, giving HPC sites a choice of local or remote management services with deployment on-premise, off-premise or a hybrid of the two.
On June 20, 2016, NVIDIA officially unveiled their Tesla P100 accelerator for PCIe-based servers. This is a long-expected PCI Express variant of the Tesla P100 accelerator that was launched in April using the NVIDIA NVLink interconnect. Let’s check out what’s new!
NVIDIA Tesla P100
The NVIDIA Tesla P100 was originally unveiled at the GPU Technology Conference on April 5, 2016. Touted as the world’s most advanced hyperscale data center accelerator, it was built around the new NVIDIA Pascal architecture and the proprietary NVIDIA NVLink high-speed GPU interconnect.
Like all other Pascal-based GPUs, the NVIDIA Tesla P100 is fabricated on the 16 nm FinFET process technology. Even with the much smaller process technology, the Tesla P100 is the largest FinFET chip ever built.
Unlike the Pascal-based GeForce GTX 1080 and GTX 1070 GPUs designed for desktop gaming though, the Tesla P100 uses HBM2 memory. In fact, the P100 is actually built on top of the HBM2 memory chips in a single package. This new package technology, Chip on Wafer on Substrate (CoWoS), allows for a 3X boost in memory bandwidth to 720 GB/s.
The NVIDIA NVLink interconnect allows up to eight Tesla P100 accelerators to be linked in a single node. This allows a single Tesla P100-based server node to outperform 48 dual-socket CPU server nodes.
Now Available With PCIe Interface
To make Tesla P100 available for HPC (High Performance Computing) applications, NVIDIA has just introduced the Tesla P100 with a PCI Express interface. This is basically the PCI Express version of the original Tesla P100.
Massive Leap In Performance
Such High Performance Computing servers can already make use of the NVIDIA Tesla K80 accelerators, that are based on the previous-generation NVIDIA Maxwell architecture. The new NVIDIA Pascal architecture, coupled with much faster HBM2 memory, allow for a massive leap in performance. Check out these results that NVIDIA provided :
Ultimately, the NVIDIA Tesla P100 for PCIe-based servers promises to deliver “dramatically more” performance for your money. As a bonus, the energy cost of running Tesla P100-based servers is much lower than CPU-based servers, and those savings accrue over time.
The NVIDIA Tesla P100 for PCIe-based servers will be slightly (~11-12%) slower than the NVLink version, turning out up to 4.7 teraflops of double-precision performance, 9.3 teraflops of single-precision performance, and 18.7 teraflops of half-precision performance.
The Tesla P100 will be offered in two configurations. The high-end configuration will have 16 GB of HBM2 memory with a maximum memory bandwidth of 720 GB/s. The lower-end configuration will have 12 GB of HBM2 memory with a maximum memory bandwidth of 540 GB/s.
Complete NVIDIA Slides
For those who are interested in more details, here are the NVIDIA Tesla P100 for PCIe-based Servers slides.
Support Tech ARP!
If you like our work, you can help support our work by visiting our sponsors, participate in the Tech ARP Forums, or even donate to our fund. Any help you can render is greatly appreciated!