Changes in IT technology, from IoT, cloud to 5G and artificial intelligence, are creating trends of explosive data growth, and the ability to handle this data is being linked to competitiveness. In addition, transition to cloud computing, increase of the use of AI and analysis, and cloudification of network and edge are driving the demand for change of IT infrastructure. Intel expects to have the largest market opportunity ever in this data-centric era, with a total size of $200 billion. Therefore, Intel introduced plans to offer software and system-level optimized solutions that can process everything, store more, and move faster to prepare for this market.
As a new portfolio for the data-centric era, Intel introduced the second-generation Xeon Scalable Processors, a new Xeon D-1600 processor, Agilex FPGA, Optane DC Persistent Memory, Optane DC SSD, QLC 3D NAND-based DC series SSD, and 800 series Ethernet adapter. The new Xeon Scalable Processors, Optane DC Persistent Memory, Optane DC SSD, and Ethernet technologies are expected to provide superior performance and efficiency in a variety of workloads through being tightly coupled in a system-level and optimization of software levels. What’s more, these innovations will be available faster through ‘Intel Select Solutions’ with a proven, optimized configuration.
|▲ Known as the codename ‘Cascade Lake’, the second-generation Xeon Scalable Processors
|▲ Ian Steiner, who took a lead architect of ‘Cascade Lake’
|▲ Major technical features of the 2nd-Gen Xeon Scalable Processors
Ian Steiner, a lead architect of the second-generation Xeon Scalable Processors, introduced about the second-generation Xeon Scalable Processors, which are also known as the codename ‘Cascade Lake’. He first compared the situation when Sandy Bridge-based Xeon E5-2600 series was introduced seven years ago to the present situation. At that time, it was in the early stages of cloudification, but now, cloud is activated in all the areas. Also, while the power consumption was important seven years ago, all of the parts are counted as ‘cost’ now. In addition, the fields where heavy computing power is required have been expanded to HPC, Analytics, AI and everywhere, and the usage of the workload specialized custom processor has increased.
The second-generation Xeon Scalable Processors provide improved performance, scalability and efficiency based on the features or platforms of the existing Skylake architecture. As for the memory support, the support capacity has doubled with 16Gb DDR4, and memory controller operation speed has increased up to DDR4-2933. The performance of AI inference has been greatly improved through AVX-512 VNNI and DL Boost technology, and the hardware-level countermeasures against vulnerabilities such as meltdown and spectre were applied. Moreover, although it uses 14nm process, there has been an improvement to achieve higher operating speed and power efficiency.
The second-generation Xeon Scalable Processors offer up to 28 cores in the 8200 series and up to 56 cores in the 9200 series. The features including cache configuration, maximum three 10.4GT/s UPI connections for die-to-die connectivity and maximum 48 lane PCIe connections are maintained. The maximum memory capacity has been increased with the support of 16Gb DDR4, and the operating speed has increased with the support of 6-channel DDR4-2933. With Optane DC Persistent Memory, it supports up to 4.5TB of memory configuration per processor. On top of that, vector operations can handle 16 DP, 32 SP and 128 INT8 MACs with DL Boost in a single cycle via AVX-512.
First introduced in the second-generation Xeon Scalable Processor family, the Xeon Platinum 9200 series processor is in the form of two processor die in one package and linked with UPI. Supporting up to two processor configurations, the Xeon Platinum 9200 Series is logically identical to the existing four-socket system in dual-processor configurations, but can be configured for higher compute densities in terms of latency or smaller form factors. The memory controller provides up to 281 GB/s of bandwidth in a 12-channel configuration per processor, utilizing both die. The Xeon Platinum 9200 series is supplied in a BGA bonded form on the motherboard and has a TDP of 250 to 400W.
|▲ VNNI allows to complete the inference-related operations that took three cycles in a single cycle
|▲ Software optimization and hardware support can lead to significant inference performance improvement
Matrix multiplication, which is mainly used in a deep learning environment, is a process of collecting values obtained by multiplying a plurality of rows and columns into a single value. And in traditional HPC or AI training workloads, floating-point operations were used here. In this case, the wide range of possible values was a drawback in performance. On the other hand, in the case of using INT8 instead of FP for inferencing, greatly reduced range of values to consider, higher power efficiency through fewer multiplications, and reduced pressure on cache and memory subsystems were mentioned as the advantages. When AVX-512 and VNNI are used in the second-generation Xeon Scalable Processor, it becomes possible to achieve four times better performance than AVX2 in the operation of receiving INT8 value and outputting to INT32.
Previously, INT8 value was input to obtain the result of INT32. The result is obtained through three stages of multiplication, up-conversion, and accumulation, and up to 128 MACs are processed using two ports and three cycles per core. However, when using VNNI, these three steps can be processed in a single cycle with a single instruction, which in theory can triple the performance. When using MKL-DNN library, it is possible to improve performance by 1.33 times by switching from AVX-512 based FP32 to INT8, and by 3 times by switching from AVX-512 based INT8 to VNNI-based INT8.
Intel introduced that in the micro-benchmark scenario of MKL-DNN, the performance per watt can be greatly increased by utilizing VNNI. When VNNI is used, the power consumption per socket becomes similar to that of FP32, but the power consumption per unit performance is greatly reduced as much as the greatly improved performance. In addition, when DL Boost technology is used, the processor's L2 cache miss probability is significantly reduced than FP32, and memory bandwidth usage also decreases.
|▲ The memory bandwidth allocation function is added to Intel Resource Director Technology.
|▲ Types of speed shift technologies applied to N-series products mainly specialized in network workload
|▲ Types of speed shift technologies applied to Y-series products specialized for data centers
Optane DC Persistent Memory, which is officially supported from the second-generation Xeon Scalable Processors, can be used in two modes; ‘Memory Mode’, which uses DRAM as a cache to expand the total memory capacity, and ‘App Direct Mode’, which is a workload-optimized form that allows applications to directly access DRAM and Optane DC Persistent Memory according to the purposes. It is compatible with DDR4 interface and 128~512GB module will be introduced. At the same time, Intel emphasized that in the development of Optane DC Persistent Memory, processors and modules were developed together from the beginning.
Intel Resource Director Technology (RDT) has also added a new technology. By using RDT, it is possible to divide the processor area so that it does not affect the performance of each job. By prioritizing and processing jobs, it is possible to maximize system utilization while maintaining SLA compliance. Moreover, RDT allows for monitoring and controlling of L3 cache and memory bandwidth. In the second-generation Xeon Scalable Processors, the Memory Bandwidth Allocation technology is added to allocate or limit the memory bandwidth for specific tasks, minimizing the performance impact of specific tasks across the entire system and ensuring compliance with SLAs.
The Intel Speed Select Technology (SST) for workload-optimized environments is made up of three specific technologies, and the application of each technology depends on the product family. Among them, SST-CP maintains a higher operating speed for priority tasks and slows down the processor operation in other lower priority tasks, while SST-BF (Base Frequency) sets a certain core to a higher operating speed and assigns a specific workload to it. With this technology, the total power consumption can be kept at a constant level while providing the optimum environment for workloads that are sensitive and non-sensitive to operating speeds.
SST-PP allows the flexibility of processor selection and server operation, and it can separately set up the maximum temperature, TDP, operating speed, or the number of cores activated by up to 3 profiles in one product. This allows choosing among the settings such as a setting of reduced number of actives cores in the processor and increased operating speed and a setting of lowered operating speed and maximized number of active cores according to the situation. In terms of the usage of this technology, it was introduced that it is possible to boot the server and provision the workload by selecting the profile of SST-PP in Ironic, an OpenStack bare-metal provisioning system. The benefits of this technology include enhancing flexibility in the infrastructure that handles workloads with different characteristics and changes.
|▲ The dual processor configuration of Xeon Platinum 9200 series processors is logically consistent with the existing 4-socket configuration.
Kartik Ananth, Senior Principal Engineer of Intel Data Center Group, introduced about Xeon Platinum 9200 series processors and platforms. One of the most significant features of this processor is the fact that it has excellent processor performance per socket by configuring two second-generation Xeon scalable Processor die into a single processor and socket. In addition, two die configurations can achieve twice the memory bandwidth per processor, yet each die is accessed with single hop latency. So, if the 'density' of computing power is important, it is possible to achieve equal capacity with less area than the existing 4 socket configuration.
The Xeon Platinum 9200 processor has a configuration of two die connected via UPI to a single processor. It supports up to two processor configurations, which are logically identical to the existing four socket configuration, with three UPIs per die connected directly to the other die. It has a 6-channel DDR4 memory controller per die, so it becomes a 12-channel DDR4 memory controller on a per-processor basis. The processor package is a BGA with 5903 contacts using 0.99mm pitch, which will be supplied at the system level with the motherboard. The Intel Server System S9200WK, featuring dual processor configurations of the Xeon Platinum 9200 series, offers up to 80 PCIe 3.0 lanes.
The Xeon Platinum 9200 series processors are available in 32-, 48- and 56-core configurations and feature 12-channel DDR4 memory controllers on all processors, delivering outstanding performance on memory performance-intensive workloads. Intel's test results show up to 407GB/s STREAM-TRIAD performance on dual processor configurations. Memory bandwidth per core is allocated at 3.6 GB/s per core on a 56-core processor and 6.2 GB/s per core on a 32-core processor, providing a favorable environment for memory bandwidth sensitive applications such as HPC applications. Furthermore, the entire TDP can be extinguished with a single heat sink in all product families.
|▲ Main features of Intel Server System S9200WK for Xeon Platinum 9200
|▲ Xeon Platinum 9200 series processors are based on the system level configuration.
Xeon Platinum 9200 series processors come with Intel Server System S9200WK. The S9200WK is a 2U rack form factor with up to four independent compute nodes depending on the node configuration, and each node is capable of warm-swap. Memory is available in 12-channel configurations with 12 DIMMs per processor, and storage can use two hot-swap U.2 NVMe SSDs per module in a 2U compute module. The power supply uses 3 units of hot-swap 2100W or 1600W in the chassis and has both air and liquid cooling options.
The compute modules include 1U 1/2 width liquid-cooled compute sled, 2U 1/2 width liquid-cooled service sled, and 2U 1/2 width air-cooled compute/service sled. Hot-swap storage is only available in 2U compute modules, and NVMe has 2 M.2 per node in 1U, 2 M.2 and 2 U.2 in 2U. PCIe extensions can use two LP PCIe cards per node in 1U and four LP PCIe cards per node in 2U. Intel Server Chassis FC2000 is Intel's disaggregated server configuration, offering power and cooling in a shared form, with three unit configurations of 1600W or 2100W for high availability and air or liquid cooling options provided.
In aspects of software architecture-wide optimization, Xeon Platinum 9200 processors have additional information on multichip packaging in CPUID. As a result, the Xeon Platinum 9200 processors with two die might recognize as two processors but the information makes it possible to logically recognize and operate as a single physical package. In addition, benefits such as the DL Boost technology of the second-generation Xeon Scalable Processors, AVX-512 support, and various software optimizations for AI can be obtained equally through Xeon Platinum 9200 processors. On top of that, AI inference performance of Xeon Platinum 8280 processors is 14 times higher than that of Xeon Scalable Processors in the early period, and Xeon Platinum 9282 achieves 30 times the improvement.