AI Machine Learning & Data Science Research

Google, Purdue & Harvard U’s Open-Source Framework for TinyML Achieves up to 75x Speedups on FPGAs

A research team from Google, Purdue University and Harvard University presents CFU Playground, a full-stack open-source framework for the rapid and iterative design of accelerators for embedded ML systems, enabling developers with minimal FPGA and hardware experience to achieve model speedups of up to 75x.

Running embedded machine learning (ML) systems on edge devices has become increasingly attractive in recent years, as migrating such systems from the cloud can improve privacy, latency, security and accessibility. However, given the ever-changing ML landscape, it is also desirable to avoid the massive up-front engineering costs associated with the creation of customized application-specific integrated circuits (ASICs) for these embedded systems.

In the new paper CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs, a research team from Google, Purdue University and Harvard University introduces CFU Playground, a full-stack open-source framework that integrates open-source software, RTL (register-transfer level) generators, and FPGA (field-programmable gate array) tools to enable the rapid and iterative design of accelerators for embedded ML systems. Developers can use the framework to design custom function units (CFUs) for distinct ML operations. In tests, even users with minimal FPGA or hardware experience were able to achieve model speedups of up to 75x.

The team summarizes their paper’s main contributions as:

  1. An out-of-the-box, full-stack framework that fully integrates open-source tools across the entire stack to facilitate rich community-driven ecosystem development.
  2. An agile methodology for developers to progressively and iteratively design tightly-coupled, bespoke accelerators for resource-constrained, latency-bound tinyML applications.
  3. Through cross-stack insights, enabled by a fully open-source stack, we demonstrate novel model-specific resource allocation trade-offs between the CFU, CPU, and memory system that enable optimal ML performance on resource-constrained FPGA platforms for two important use cases.

CFU Playground runs a complete system-on-chip (SoC) on an FPGA to capture the full-stack system effects of accelerating ML models. The framework includes software, gateware, and hardware components to provide developers with a fast and effective open-source design flow that can realize significant speedups by exploring the design space between the CPU and a tightly-coupled CFU.

The CFU Playground’s gateware is built upon the LiteX framework, providing a convenient and efficient infrastructure for creating FPGA soft cores and SoCs. The gateware is adaptable to a wide range of hardware platforms and currently supports the Xilinx 7-Series, Lattice iCE40, ECP5, and CrossLink FPGAs. The software, a.k.a. the custom instructions, are added to the CPU’s instruction set to invoke the CFU.

CFU Playground enables a deploy→profile→optimize loop for iterative, guided optimization of resource-constrained ML systems. This loop helps developers quickly focus their design effort on any layer of the stack, measure its performance at a fine granularity, implement custom optimizations, and repeat the process.

The team had developers with varying levels of expertise experiment with the CFU Playground on two common tinyML use cases: image classification and keyword spotting. In the image classification task, a senior engineer working part-time on the project for five weeks achieved a 55x speedup on the MobileNetV2 convolution operation. For keyword spotting, an undergraduate-level intern with minimal FPGA and hardware experience obtained a cumulative speedup of 75x in under four weeks.

The empirical results show that CFU Playground helps developers produce iterative improvements with relative ease and that even inexperienced developers can co-optimize a CPU and CFU in severely resource-constrained environments using the framework.

The researchers say that in future work, they hope to integrate CFU Playground into a closed-loop learning-based system that uses ML methods to automatically optimize embedded systems.

The paper CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs is on arXiv.


Author: Hecate He | Editor: Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “Google, Purdue & Harvard U’s Open-Source Framework for TinyML Achieves up to 75x Speedups on FPGAs

  1. Pingback: FPGA Weekly News #002

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: