RAG
Chunker
The chunker is responsible for splitting the documents into smaller chunks. Every chunker in Dingo must be a subclass of the agent_dingo.rag.base.BaseChunker class. The chunker should implement the chunk method, which takes a list of documents and returns a list of chunks. The chunk is an instance of the agent_dingo.rag.base.Chunk class, which contains the content as a string, a reference to the parent document, and the embedding as a list of floats (initially None).
Dingo provides a simple RecursiveChunker that splits the document recursively until the specified chunk size is reached.
from agent_dingo.rag.chunkers.recursive import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(docs)
print(chunks)
# [Chunk(content='...', document=Document(content='...', metadata={'source': 'file.docx', 'paragraph': 1}), embedding=None), ...]
Supported parameters:
chunk_size: the maximum number of characters in a chunk.separators: a list of strings that can be used as separators to split the document. By default is None, in which case["\n\n", "\n", " ", ""]are used.keep_separator: a boolean flag that determines whether to keep the separators in the chunks. By default is False.merge_separator: a string that is used to merge small chunks into larger ones. By default" ".