Integrating an NPU with PyTorch 2.0 Compile - Sol Kim & Juho Ha, FuriosaAI
We would share our experiences of integrating NPU and its compiler, written in Rust, with 'PyTorch 2.0 Compile' through the following subjects: 1. Leveraging PyTorch 2.0 Features for NPU Accelerable Model Compilation: We will elaborate on how we leveraged PyTorch 2.0 features, such as GraphModule, TorchDynamo, lowered IR (Core Aten, Prims IR), and others, to compile PyTorch written models into NPU accelerable models. 2. Utilizing GraphModule as a Frontend and NPU Execution Interface: We will explain how we utilized GraphModule as both a frontend for our compiler and as an interface for NPU execution. Additionally, we'll discuss its capabilities in providing debuggability and programmability. 3. Influence of PyTorch 2.0 Compile on NPU Compiler Design: We will share how PyTorch 2.0 Compile has influenced the design of our NPU compiler. We will also share our expectations as an NPU vendor and a PyTorch community member for future developments in this domain.
ONNX Runtime Web: Running Your PyTorch Model in Browser - Emma Ning, Microsoft
Running machine-learning-powered web applications in browsers has drawn a lot of attention from the AI community. ONNX Runtime Web (ORT Web) enables JavaScript developers to run and deploy machine learning models in browsers. It accelerates model inference in the browser on both CPUs and GPUs, through WebAssembly (WASM) and WebGPU/WebGL backends separately. Taking Mobilenet V2 as an example, the CPU inference performance can be accelerated by 3.4x with ORT Web WASM backend with two threads together with SIMD enabled, comparing the pure WebAssembly.
PyTorch to EmitC - Translating PyTorch Models to C++ - Marius Brehler & Simon Camphausen, Fraunhofer IML
Authors: Marius Brehler, Simon Camphausen, Lucas Camphausen. Deploying machine learning (ML) models to microcontrollers is of great interest as it enables intelligent decision-making on low-power embedded devices without relying on a network connection or cloud computing. One common approach to deploy an ML model to a microcontroller is to use TensorFlow Lite. To deploy a PyTorch model, the model can be converted to ONNX which can be executed with the ONNX runtime or deployed by using deepC or cONNXr. A more flexible approach is to translate the PyTorch model into C or C++. This is realized by using the MLIR framework and in particular by using the MLIR dialect EmitC. The approach is of special interest when deploying to bare-metal systems, since it enables the use of non-LLVM-based C and C++ compilers. Moreover, the approach is of interest if the dependencies on libraries or runtimes should be reduced. It allows to generate C++ code that has no dependencies other than the standard library, enabling further use cases.
Accelerating PyG with Torch.Compile on Intel Xeon CPUs - Mingfei Ma, Intel
torch.compile() is flag feature introduced in torch 2.0 to speed up your PyTorch code. torch.compile() can auto-generate optimized kernels across multiple backends/accelerators with minimal code change. torch.compile() works perfectly with PyG models now and Intel has been working closely with Kumo.ai to improve the performance and user experience by optimization from the deep learning compiler TorchInductor on both float32 and bfloat16. Additionally, on the 4th generation of Intel Xeon Scalable Processors (codename Sapphire Rapids), bfloat16 computation throughput is further enhanced through Advanced Matrix Extensions (Intel AMX) instruction set extension. Currently, all the commonly used message passing pattern in PyG models can be converted to a single optimized kernel with torch.compile(), which will reduce the memory access payload and increase cache locality. Benchmark results on popular GNN models such as GCN, GraphSAGE, GIN and EdgeCNN demonstrate a performance improvement of over 300% on Sapphire Rapids!
Advanced PyTorch Model Optimization with Neural Network Compression Framework of OpenVINO - Yamini Nimmagadda & Raymond Lo, Intel
Neural Network Compression Framework (NNCF) is the model optimization library designed to improve inference performance when running Intel® Distribution of OpenVINO™ Toolkit. It’s a cross-framework tool that supports PyTorch, ONNX, TensorFlow and OpenVINO formats as well as a diverse set of optimization methods including Post-Training Quantization (PTQ), Quantization-Aware Training, Mixed-precision Quantization, Structured and Fine-grained Pruning methods, NAS, and Knowledge Distillation. NNCF is a cross-hardware framework that follows an OpenVINO paradigm: write once, deploy everywhere. It is also integrated into the Hugging Face ecosystem so users can benefit from the easy-to-use optimization and inference API for Transformer-based models. Post-Training Quantization is the most demanded and scalable method supported in NNCF. In this talk, we will cover recent advances in Post-Training Quantization that are already integrated into NNCF and how users can leverage them for the efficient model deployment of different types of DNN models on various hardware.
A Journey to Enable Generative AI on a New Hardware Platform with PyTorch 2.0 - Kazuaki Ishizaki, IBM Research - Tokyo
This talk explains our journey in enabling generative AI applications on a new hardware (HW) platform. We are working on running generative AI applications on IBM z from correctness and runtime performance perspectives. We share experiences for developers to write PyTorch and its ecosystem for a new HW. IBM z has a unique feature that uses big-endian byte order. While most HW platforms use little-endian, big-endian was not supported well in PyTorch and its pip packages. We supported both endians, for example, to exchange pre-trained models among any platform by fixing test and application failures. Our 32 PRs make no test failure on IBM z. The ecosystem, like the Hugging Face (HF) transformer framework, now works well. We will share our experience to enable CI for a new HW to keep the main branch healthy for a new HW. We enable HW acceleration features in PyTorch runtime and TorchInductor, such as SIMD. We will also briefly explain exploiting an in-core AI accelerator. Here are the takeaways: - Enabling a new HW without test failures in PyTorch and its ecosystem, like HF transformers - Adding CI for a HW new platform in the upstream - Enabling performance features for a new HW
OneDNN Graph with TorchInductor for Enhanced Graph Fusions and Performance - Ashok Emani, Intel Corporation & Frost Mitchell, University of Utah
TorchInductor OpenMP backend has demonstrated promising performance on DL inference workloads with CPU, thanks to optimizations like Conv/GEMM + post-op fusions and vectorization. Graph Extension in oneDNN goes beyond Conv/GEMM post-op fusions and supports aggressive fusion patterns such as Multi-head attention, MLP blocks, and more with its graph compiler backend. Other features include low precision. Since PyTorch 1.12, this has been added in TorchScript JIT fuser path showing promising performance. This poster showcases how integrating oneDNN Graph with TorchInductor OpenMP backend enables PyTorch 2.0 torch.compile use-cases and opportunities for more advanced performance optimizations in the future.