4. System Architecture

The architecture powering Vocalad is designed for modular scalability, multilingual compatibility, and real-time responsiveness. It is composed of four critical layers: Data Ingestion, RAG Retrieval Pipeline, Voice Generation (SLM + TTS), and Integration Layer.

4.1 Data Ingestion and Dataset Structuring

Vocalad begins with structured ingestion of project-specific data. The dataset is manually or semi-automatically curated by the Vocalad team and processed as follows:

  • Accepted Input Formats: whitepapers, FAQs, governance documents, community threads, founder notes, and chat logs.

  • Preprocessing: Includes stopword removal, phrase normalization, token truncation, and document splitting using token-size constraints (e.g., 512–1024 tokens).

  • Contextual Tagging: Each chunk is annotated with metadata such as topic, section, version, and relevance rank.

  • Multilingual Embedding: Documents are embedded using cross-lingual encoders like LaBSE and E5-Mistral to create high-fidelity vector representations, regardless of the language of the original content.

Use in Vocalad: This ensures that even multilingual datasets (e.g., Urdu documentation + English FAQs) are unified into a single searchable index, allowing the agent to understand a query in any language and retrieve the correct response in real time.


4.2 Retrieval-Augmented Generation (RAG) Layer

The RAG module is the core of Vocalad’s intelligence. It performs scoped information retrieval and constrains generative responses to the ingested dataset. Its key components include:

  • Vector Indexing: FAISS or Weaviate is used for high-speed similarity search across chunked and embedded documents.

  • Top-K Retrieval: Given an incoming query (in any language), the model retrieves K most semantically similar passages using cosine similarity or hybrid sparse-dense scoring.

  • Prompt Fusion: Retrieved chunks are dynamically injected into the system prompt of the language model before generation, ensuring context-awareness and restriction to source material.

  • Guardrails: A token-level post-filtering step ensures no out-of-scope or hallucinated content is included in the output.

Use in Vocalad: This prevents the agent from "making things up" and ensures it only speaks based on project-approved facts, critical in contexts involving tokenomics, roadmap timelines, or legal disclaimers.


4.3 Speech Language Model (SLM) and Voice Generation

Once text output is generated from the RAG system, it is passed to the voice pipeline composed of:

  • SLM Optimization: The generated response is processed to ensure it adheres to speech norms — clarity, pause placement, removal of redundancy, and emotional tonality.

  • TTS Model: Output is converted to voice using OpenVoice, Bark, or other high-performance multilingual TTS engines. Custom voices (tone, pitch, accent, speaking speed) can be defined per project.

  • Multilingual Output: Vocalad maps the output language based on the query’s input or user preference, enabling real-time multilingual voice responses.

Use in Vocalad: Whether a user asks a question in Hindi or Arabic, Vocalad can both understand the input and deliver a voice answer in the same (or a different) language — fully aligned with the source dataset.


4.4 Real-Time Output & Integration Layer

The final layer packages the voice output and delivers it to the intended platform in sub-second latency through secure, scalable APIs:

  • Platform Routing: Vocalad supports streaming to Telegram voice channels, X Spaces (via bridge integration), web frontends, and upcoming SIP endpoints.

  • API Interfaces: REST and WebSocket endpoints allow on-demand interaction from apps, dashboards, and chat-based environments.

  • Session Context Buffering: Within a single user session, recent queries and topics are remembered for improved coherence and reference handling.

  • Audio Playback Engine: Converts final audio into streamable format compatible with platform-specific codecs (Opus, AAC, etc.)

Use in Vocalad: This allows Vocalad to act not just as a static responder but as an interactive AMA co-host, community mod, or support desk — with a voice that speaks in context, continuously, and naturally.

Last updated