Home Update Hot Chips 2020 Live Blog: Baidu Kunlun AI Processor (4:30pm…

Hot Chips 2020 Live Blog: Baidu Kunlun AI Processor (4:30pm…

365


07:29PM EDT – Last session of Hot Chips is all about ML inference. Starting with Baidu, and its Kunlun AI processor

07:30PM EDT – We’ve heard of Baidu’s Kunlun a number of months in the past resulting from a press launch from the corporate and Samsung stating that the silicon was making use of Interposer-Cube 2.5D packaging, in addition to HBM2, and packing 260 TOPs into 150 W.

07:32PM EDT – Baidu and Samsung construct the chip collectively

07:33PM EDT – Need a processor to cowl a diversified AI workflow

07:33PM EDT – NLP = Neural Language Processing

07:33PM EDT – All these techniques are precedence inside Baidu

07:34PM EDT – Traditional AI computing is carried out in Cloud, Datacenter, HPC, Smart Industry, Smart City

07:35PM EDT – High-end AI chips price loads to create

07:36PM EDT – Try to discover market quantity as a lot as doable

07:36PM EDT – The problem is the kind of compute

07:36PM EDT – Design and implementation

07:38PM EDT – Kunlun (Kun-loon)

07:38PM EDT – Need versatile, programmable, excessive efficiency

07:38PM EDT – Moved from FPGA to ASIC

07:39PM EDT – 256 TOPs in 2019

07:42PM EDT – (the presenter is a bit gradual fyi)

07:43PM EDT – Now some element

07:43PM EDT – Samsung Foundry 14nm

07:43PM EDT – Interposer package deal, 2 HBM, 512 GB/s

07:43PM EDT – PCIe 4.zero x8

07:43PM EDT – 150W / 256 TOPs

07:43PM EDT – PCIe card

07:44PM EDT – 256TOPs for INT8

07:44PM EDT – 16 GB HBM

07:44PM EDT – Passive cooling

07:45PM EDT – Same format as XPUv1 proven in HotChips 2017

07:45PM EDT – XPU cluster

07:45PM EDT – Software outlined neural community engine

07:45PM EDT – XPU-SDNN

07:46PM EDT – XPU-SDNN does tensor and vector

07:46PM EDT – XPU-Cluster does scalar and vector

07:46PM EDT – Each cluster has 16 tiny cores

07:46PM EDT – every unit has 16 MB on-chip reminiscence

07:47PM EDT – (what are the tiny cores?)

07:47PM EDT – Graph compiler

07:47PM EDT – helps PaddlePaddle, Tensorflow, pytorch

07:48PM EDT – XPU C/C++ for customized kernels

07:48PM EDT – 256 TOPs for 4096x4096x4096 GEMM INT8 inference

07:51PM EDT – These benchmarks are very odd

07:51PM EDT – massive edge = industrial

07:51PM EDT – Mask inspection

07:52PM EDT – Mask RCNN

07:52PM EDT – Available in Baidu Cloud

07:53PM EDT – Q&A time

07:54PM EDT – Q: {hardware} picture/video decode? A: No

07:55PM EDT – Q: INT4 throughput as INT8? A: INT4 identical as INT8, however INT4 and leverage extra of the capabilities

07:56PM EDT – Q: Size and BW of on-chip shared reminiscence? A: BW is 512 GB/s for every port every cluster (I do not assume that solutions the questions)

07:56PM EDT – Q: Static scheduling of sources? A: Yes

07:57PM EDT – Q:Power? A: Real Power 70-90W, virtually identical as T4, however TDP 150W

07:57PM EDT – That’s a wrap. Next discuss is Alibaba NPU



Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here