RAG

Vector stores

The vector store is responsible for storing the embeddings of the chunks and retrieving the most relevant chunks for a given query. Every vector store in Dingo must be a subclass of the agent_dingo.rag.base.BaseVectorStore class. The vector store should implement two methods:

  • upsert_chunks: which takes a list of (embedded) chunks and stores them in the database.
  • retrieve: which takes an embedded query and returns a list of the most relevant chunks.

Currently, Dingo supports two major open-sorce vector databases: Qdrant and Chroma DB.

from agent_dingo.rag.vector_stores.qdrant import Qdrant

chunks = [...] # list of embedded chunks

query = "query"
query_embedding = embedder.embed([query])[0]

vector_store = Qdrant(collection_name="collection_name", embedding_size=384)
vector_store.upsert_chunks(chunks)
retrieved_chunks = vector_store.retrieve(k = 1, query = query_embedding)
print(retrieved_chunks)
#RetrievedChunk(content='...', document_metadata={'source': 'file.docx', 'paragraph': 7}, score=0.913)

Supported parameters:

  • agent_dingo.rag.vector_stores.qdrant.Qdrant:
    • collection_name: the name of the collection in Qdrant.
    • embedding_size: the size of the embeddings.
    • host: the host of the Qdrant server, by default is None.
    • port: the port of the Qdrant server, by default is None.
    • path: the path of the Qdrant database (in case of local instance), by default is None, in which case the database is stored in memory.
    • url: the full url of the Qdrant hosted qdrant service, by default None.
    • api_key: the api key for the Qdrant service, by default None.
    • recreate_collection: a boolean flag that determines whether to recreate the collection if it already exists, by default is False.
    • upsert_batch_size: the number of chunks to upsert in a single batch, by default is 32.
  • agent_dingo.rag.vector_stores.chromadb.ChromaDB:
    • collection_name: the name of the collection in ChromaDB.
    • path: the path of the ChromaDB database, by default is None.
    • host: the host of the ChromaDB server, by default is None.
    • port: the port of the ChromaDB server, by default is None.
    • recreate_collection: a boolean flag that determines whether to recreate the collection if it already exists, by default is False.
    • upsert_batch_size: the number of chunks to upsert in a single batch, by default is 32.
Previous
Embedder