🛜🤖Introduction to distributed inferencing/training framework EXO and further roadmap for Ubiquitous edge -ML usecases.

Frontier tech
6 min readJul 21, 2024

--

Distributed inferencing using Exo framework (Credits: Alex Cheema X Post)

Credits:

I want to express my sincere gratitude to Alex and Mohammed from Exolabs for providing excellent, no-nonsense real-world demonstrations of their framework on consumer devices and for open-sourcing their implementation.

About

The development of capabilities for running benchmark ML models on edge devices has progressed beyond optimizing vertical scaling (e.g., 1.58-bit quantization of models, building efficient RAG pipelines). However, it still relies heavily on:

  • Building custom underlying edge devices (including on-SoC drivers like Qualcomm’s AI SDK, etc.).
  • Frameworks that compile and run models optimized for custom silicon (MLX for Apple or Nvidia’s TensorRT/AMD, etc.).
  • The performance of deep learning compilers to generate better code (ranging from prominent frameworks like PyTorch and TensorFlow to more optimized development frameworks like ModularML Max + Mojo).

Therefore, there’s a need for horizontal scaling, achieved through:

  • Increasing compute bandwidth by performing large-scale coordinated processing of sharded jobs on devices (similar to the MapReduce approach).
  • Enabling devices with different ISA and OS to run the model interoperate and combine their results for the user.

This tutorial aims to provide a comprehensive overview of how Exo performs inference from models via sharding across model layers. It will first describe the library in detail with a demo and then explore potential ways to further scale this approach for running more complex use cases at a larger scale.

🕸️1. Overview of the exo implementation:

It consist of the flat p2p model inference architecture with the following components in the application:

  1. Exo.inference :
  • Exo consists of a flat peer-to-peer (P2P) model inference architecture with the following components in the application:
  • Exo.inference: This library section defines the characteristics of the model shard and the functions that convert the hosted models on Hugging Face (HF) to be quantized. Each shard is a class object consisting of (total number of layers, initial layer index, last layer index, and its ID). It's then loaded with quantization using the Low-Rank (LORA) quantization technique (by dividing the storage of the safe-tensor data for each node from the HF hub model).

Within each node, the state of the trained shard of the overall model is stored by the class StateShardfulModel, which implements:

* The sampling technique for determining the next token for a given head and a given token sequence based on the received input (with the temperature parameter).
* The key-value (KV) cache that stores the intermediate attention from each layer and multiple attention heads to generate the prompt output and then resets the cache.

NOTE: Although one thing that this module doesnt implement the evaluation of the query results from the given query by the given shard, something that integrating the LLM eval platforms like W&B and Giskard can be developped to match the relative performance of the prompt generation based on the current state along with the progress.

  • And at the higher level it implements 2 interfaces for storing the inference : Either for the Apple devices it uses exo.inference.mlx.MLXShardedInferenceEngine , else it uses the TinyGrandInferenceengine , these both classes defines the functions for the inference of the query and tensor , along with initializing the initial shard and tokenizer.

2. Exo.networking:

This section implemets the gRPC based p2p client discovery and distributed inference protocol . Before even starting to setup the inference pipeline , there needs to be a setup of a non directed communication graph between the inferencing instances in order to insure that they remain with significant communication bandwidth during the inferrence session.

First It creates the device profile (consisting to device ID , specifications (GPU/RAM) and storage)corresponding to each device node . it then tries to create the asyncio connections between the different devices by building the channels (as implemented in the gRPCDiscovery) with the other peer nodes along with updating the topology of the connected nodes (as will be explained in the next section).

Then once the sufficient number of GPU’s are connected (and the peer discovery protocol reaches the timeout), it then setups the sharding and the messaging protocol between the user terminal and the other nodes in order to tokenize the query and then share it across the shard models for generating the result.

3. Exo.orchestration: This library defines the code for the node server (which is setup for each of the device). it consist of API’s that

  • Setups each node in order to start discovery process and then updates the peers.
  • Keeps track of the nodes that are connected during the course of training of the shard, and dynamically readapts the connection of the corresponding topology in case one of the nodes is down.
  • broadcasts the generation of the prompts and the tensors for the query after they receive the results of evaluation.
  • and gracefully shuts down the server and stores the result when prompted by the user.

4. Exo.model: This package consist of the utils method that fetches the device processing capablities , personal metadata in order to maintain the connected topology .

  • also after parsing the given model weights it defines the sharding strategy (simply being FCFS unlike the randomization technique that Nvidia TensorRT-LLM implementation).

2. Potential usecases / cool side projects that can be build using exo:

Given that the framework is modular enough to fetch any of the pre-build model weights from HuggingFace and then run the models on distributed shards , following are some of the potential short term projects that i am interested (and will like to add demo’s for the upcoming series of article):

2.1 On the edge RAG pipeline:

Possibility for building the on the edge RAG pipeline, With each of the nodes getting the corresponding embeddings the and tokenized query, passing it to the InferencingEngine in order to generate the prompt results from the various shards. This is gonna be beneficial for making it feasible for the on device LLM projects (ollama , llama.cpp frameworks) to get better throughut , but at the same time resolve the issues of hallucination.

2.2 Model merging → quantization → inference:

As there has been great progress in the field of merging of the various models across layers (mergekit by Maxime Laborne ) and then quantization is gonna be the norm for building real-time finetuning of the model removes the performance bottlebnecks due to network connectivity on next level.

2.3 Building real time collaborative multimodal pipelines for various contextual information:

This is something I, working on productionizing 3D reconstruction algorithms @ Extralabs, have been currently working on how the fusion of the photogrammetric models as shards running on the multiple models are possible based on the . since the time we tested running prominent algorithm neuralangelo on compute over data, we were still very constraint on running real time reconstruction of video / photo images to fine mesh . implementing exo framework with optimised vLLM models like gaussian splatting is gonna be an immense contribution in the field of building decentralised version of Luma at the larger scale . i am eager to demonstrate the toy demo for this usecase in the coming weeks with our first release , kindly follow our website/blog to know more.

And thats it for now !! thanks a lot for readers to read the theoretical description of the most famous open source frameworks at this time. i will soon write the 2nd part sharing the notebooks and github repo for developers who are interested to get experienced in building robust on edge ML training pipelines.

Let me know any feedbacks and lets continue getting last mile delievery of dev tools to GPU poor community ;) .

--

--

Frontier tech
Frontier tech

Written by Frontier tech

developer focused newsletter presenting the simple demonstrations of building products combining AI and web3 for impactful usecases

Responses (1)