Building multimodal personal message archival retrieval and querying application: Intro and specification requirements.

Frontier tech
8 min readJan 14, 2024

--

credits : huggingface-hub data diffusers model.

TL;DR

Its the system design tutorial of building multimodal RAG application that allows the user to do reasoning queries on the historical data of telegram by running query retrieval across telegram chat data along with the embedding data.

If you’re interested to go directly trying code, i will be writing shortly another article with the links to the repo and deployment links . So follow the upcoming blog to get into detailed implementation.

About

On the eve of 2024, having time to endlessly scroll telegram through the various community channels of tech / web3 (in order to ̷p̷a̷s̷s̷t̷i̷m̷e̷ / do personal SWOT analysis) , i was indeed startled by the sheer amount of the communications and leads that we can’t read in the different community channels.

i am done w/ idea to manually read all of these within the evening and bookmarking ;) .

Thus i got interested into the idea of implementing the RAG powered LLM app on telegram data that:

  1. Fetches the messages from the given channel subscribed by the user.
  2. Then it does the data engineering and embedding generation using unstructured process.
  3. Then based on the user queries, the application will then get corresponding context (documents and embeddings to augement the query) which will then be queried to LLM and then generate the result.
general workflow

Frameworks to be used:

I will be talking about the following frameworks in tech stack (as this tutorial has been written only for developing the design roadmap of the application, we tend to discuss all the characterstics of each of the frameworks in order to get the better alternative tech stack):

  • LlamaIndex: define the various components of the RAG pipeline (prompt splitting / chunking, vectorisation and development of the agents to serve the query with context).
  • Weaviate for storing the vector embeddings for query and context generation.
  • OpenAI GPT4V / AssemblyAI / huggingface for LLM as a service.

And optionally we can use various libraries / frameworks for the data engineering and storage of multimodal data as follows:

  • Unstructured: Its an open source package that provides methods for users to develop multimodal machine learning data training pipelines. I have followed the example on how the unstructured can be used to define the data pipelines for the LLM models with multimodal inputs (e.g. PDF with images, videos and audio data).
  • DocArray: Its the python library that is used for managing the lifecycle of transmission, storage and retrieval of the multimodal data.

The pipeline consist of the 4 major steps:

workflow description of the ingestion stage.
  1. Data ingestion:

This requires use of the telethon python library in order to parse all of the messages present in the channel that you want to index based on the criterias (like number of messages, date range of the messages etc) . In order to build the vector embeddings to augment the query for retrieval queries, the data should be ingested in the mapping the messages of the user corresponding to the user identity.

simple UML of the telethon provider class

The result should be:

  • A json file with the messages for the similar channel (with the corresponding link of the audio and video).
  • all the corresponding data assets (images , videos, audio) stored in the folders along with the reference links are provided in the previous json file.

2. Data indexation: This starts by indexing the information (i.e messages ) corresponding to each of the users across all modalities. this can be done on llamaindex as following steps:

  • Loading parsed data: by parsing the json file generated by telethon and then clubs the necessary sections of the data in order to create a json output in the given directory (let it named as ‘tg_data’).

— In llamaindex, these are represented by Documents class and each of the data chunks that correspond to the given user messages are represented by Nodes. Users have the flexibility to define the strategy for storing the individual loaded documents by directly using the pre-defined parsers by the llamaindex based on the nature of the data (like SentenceSplitter for chunking the sentences from general text, CodeSplitter for splitting code into AST etc).

— For the case of Audio we can implement a modified version of the NodeParser class by implementing the version of the AudioToTextTranscripter class as defined in the tutorial.

— Similarly for the Image and video, we can implement the modified implementation of the NodeParsing class that embeds the image and video transcriptions of the actions that have happened in the given video (subtitles, description of the actions or the base64 representation of the image frames).

Then in order to create an hierarchical representation of the various modalities, we define unique context storage for each of the modalities in the vector storage by using the StorageContext .

3. Embeddings: This is the major step of the stars. Here we will consider the MultiModalVectorStoreIndex class implemented in llamaindex to generate the vector embeddings that uses the CLIP embeddings developed by openAI. this consist of :

1. Key being the user identity (either as the username or their unique id).

2. Value being the floating point vectors across the audio, video and text embeddings (note: the audio, video and image embeddings are generated ).

different category of multimodal embeddings (credits to the tenkys blogger )

Note that you can also create your custom embedding functions by checking out the various embedding algorithms by implementations done by vector DB like chroma and then writing our own implementation using the llama_index.core.BaseEmbedding class.

4. Querying stage: This is the final stage where the user query (in form of text or even with the description of the collection of input query in various modalities combined with their embeddings to define the query and generate the response.

We will be using the pre-build classes in the llama_index.multi_modal_llm class which has integrations to the various hosted models in the space . we will be using the openAI GPT4V . the prompt query can be done either

  • Doing reasoning query on the given piece of information (in either modality ) and then asking the query: by using the OpenAIMultiModal class on the retrieved documents by using the following example script , we can allow the user to give input image / video / audio excerpt and then add the query (like what is the common theme of this image wrt the shared images on this group, etc…) and then you’ll get the prompt result giving the plaintext output.
  • Or using index.as_query_engine api (e.g): this is more applicable for the commercial usage as it allows the developer of the application to define the predescribed template for the given user, and then responds to the user query in the given template. we can use define the template like:
## NOTE: again its the pseudocode, so follow the part-2 for more details.
from llama_index.prompts import QuestionAnswerPrompt
## here index is the object of MultiModalVectorStoreIndex
## created in the indexation stage
template_ = ( "Given the images provided, "
"answer the query.\n"
"Query: {query_str}\n"
"Answer: ")
template = QuestionAnswerPrompt(qa_tmpl_str)
openai_llm = OpenAIMultiModal(
model="gpt-4-vision-preview", api_key=OPENAI_API_TOKEN, max_new_tokens=1500
)
= index.as_query_engine(
multi_modal_llm=openai_mm_llm, image_qa_template=qa_tmpl
)
query_str = "explain me about the events happening in the flyer of the following event"
response = query_engine.image_query("./tg/wagmi.jpg", query_str)
## or using more advanced features
## frameworks as defined in the llama_index.prompts template
## custom_template = (…..)
## from llama_index.prompts import KnowledgeGraphPrompt, ….

This eventually translates in the following roadmap :

the indexation phase

Query generation from prompt + result.

5. Evaluation(optional): This is although optional from the aspect of testing the whole lifecycle but becomes important for evaluating the RAG systems when put in the production. there are various factors that cause impediment in getting perfect responses from these systems like:

  • inability of embeddings and query engine to adapt to constantly changing data.
  • factually incorrect responses due to LLM hallucination, not able to detect image or audio translation correctly etc.

In general there have been many metrics to get the results (for eg) and indexllama does provide many out of the box integrations for the user to analyse the performance of the model.

In our case we can check the metrics using the llama_index.evaluation.MultiModalRetrieverEvaluator class by following the example mention here

F.A.Q

Q1. What are the improvement points in the llamaindex framework ?

A: Although llama_index is indeed performant in terms of providing modular design and well documented API’s for the integration across the various

  • There is need of integrating with the huggingface_hub in llama_index.multi_modal_llm integration that would allow any of the given multimodal model tasks to directly integrate with the
  • Still the embeddings of visual and audio data is still represented in the form of the text representations and the functions implemented in the base class of embedding is oriented to find the similarity between the methods of the , there lacks methods to determine the correlation between the image and the corresponding text description. there has been active research in this regard on how to build and train models by learning from abstract representations of incomplete image to learn using self supervised learning ( aka I-JEPA from Yann le Cun etal). its still in active research on how to apply from current unimodal to multimodal model architectures.

Q2. How to parallelise the implementation of the workflow pipelines

A: By using the Ingestion-pipeline method that helps to create parallel workflows for clients and combine various methods at the different stages and helps user to parallelise the queries on multiple multimodal documents . This combined with the ETL workflows like Airflow/Flyte can make LLMops for customized parameters as breeze.

🎉 And that’s it , thanks for reading the article till here and hope you have learned the various steps of building a full fledged RAG application. Indeed let me know how you like the format and feedbacks and follow my blog (medium as well as substack )to get notifications of second part of this article or more articles on various system design of LLMaaS applications

Credits:

--

--

Frontier tech
Frontier tech

Written by Frontier tech

developer focused newsletter presenting the simple demonstrations of building products combining AI and web3 for impactful usecases

No responses yet