Blair Frank reports in Venture Beat:
Microsoft showed off a new system that will allow developers to deploy machine learning models onto programmable silicon and achieve high performance beyond what they’d be able to get from a CPU or GPU. FPGAs let programmers configure hardware optimized to execute functions prior to runtime, like doing the math to serve insights from neural networks. The lack of batching means it’s possible for the hardware to handle requests as they come in.“We call it real-time AI because the idea here is that you send in a request, you want the answer back.”
Microsoft made a splash in the world of dedicated AI hardware when it unveiled a new system for doing high-speed, low-latency serving of machine learning models. The company showed off a new system called Brainwave that will allow developers to deploy machine learning models onto programmable silicon and achieve high performance beyond what they’d be able to get from a CPU or GPU.
Researchers at the Hot Chips conference in Cupertino, California showed a Gated Recurrent Unit model running on Intel’s new Stratix 10 field programmable gate array (FPGA) chip at a speed of 39.5 teraflops, without batching operations at all. The lack of batching means that it’s possible for the hardware to handle requests as they come in, providing real-time insights for machine learning systems.
The model that Microsoft chose is several times larger than convolutional neural networks like Alexnet and Resnet-50, which other companies have used to benchmark their own hardware.
Providing low-latency insights is important for deploying machine learning systems at scale. Users don’t want to wait long for their apps to respond.
“We call it real-time AI because the idea here is that you send in a request, you want the answer back,” said Doug Burger, a distinguished engineer with Microsoft Research. “If it’s a video stream, if it’s a conversation, if it’s looking for intruders, anomaly detection, all the things where you care about interaction and quick results, you want those in real time,” he said.
However, some previously published results on hardware-accelerated machine learning have focused on results that optimize for throughput at the cost of latency. In Burger’s view, more people should ask how a machine learning accelerator can perform without bundling requests into a batch and processing them all at once.
“All of the numbers [other] people are throwing around are juiced,” he said.
Microsoft is using Brainwave across the army of FPGAs it has installed in its data centers. According to Burger, Brainwave will allow Microsoft services to more rapidly support artificial intelligence features. In addition, the company is working to make Brainwave available to third-party customers through its Azure cloud platform.
FPGAs let programmers configure hardware optimized to execute particular functions prior to runtime, like doing the math necessary to serve insights from neural networks. Microsoft has deployed hundreds of thousands of FPGAs in its data centers on boards that are slotted into servers and connected to the network.
Brainwave loads a trained machine learning model into FPGA hardware’s memory that stays there throughout the lifetime of a machine learning service. That hardware can then be used to compute whatever insights the model is designed to generate, such as a predicted string of text. In the event that a model is too big to run on a single FPGA, software deploys and executes it across multiple hardware boards.
Microsoft isn’t the only company investing in hardware that’s supposed to accelerate machine learning. Google announced the second revision of its Tensor Processing Unit — a dedicated chip for machine learning training and serving — earlier this year. There are also a slew of startups building dedicated hardware accelerators for use with machine learning.
One of the common criticisms of FPGAs is that they are less fast or less efficient than chips made specifically to execute machine learning operations. Burger said that this performance milestone should show that the programmable hardware can deliver high performance as well.
In addition, the performance displayed today is running on brand new hardware, and Burger said that there’s room for Intel and Microsoft to further optimize both the hardware’s performance and Brainwave’s use of it. With further performance improvements, it should be possible for Microsoft to hit 90 teraflops with the Intel Stratix 10.
Right now, Brainwave supports trained models created using Microsoft’s CNTK framework and Google’s TensorFlow framework. Burger said that the team is working on making it compatible with other tools like Caffe. Microsoft hasn’t given a roadmap for when it will make Brainwave available to its customers, but is working towards a future when third parties would be able to bring any trained model and run it on Brainwave.
0 comments:
Post a Comment