RAG
Chunker
The chunker is responsible for splitting the documents into smaller chunks. Every chunker in Dingo must be a subclass of the agent_dingo.rag.base.BaseChunker
class. The chunker should implement the chunk
method, which takes a list of documents and returns a list of chunks. The chunk is an instance of the agent_dingo.rag.base.Chunk
class, which contains the content as a string, a reference to the parent document, and the embedding as a list of floats (initially None
).
Dingo provides a simple RecursiveChunker
that splits the document recursively until the specified chunk size is reached.
from agent_dingo.rag.chunkers.recursive import RecursiveChunker
chunker = RecursiveChunker(chunk_size=512)
chunks = chunker.chunk(docs)
print(chunks)
# [Chunk(content='...', document=Document(content='...', metadata={'source': 'file.docx', 'paragraph': 1}), embedding=None), ...]
Supported parameters:
chunk_size
: the maximum number of characters in a chunk.separators
: a list of strings that can be used as separators to split the document. By default is None, in which case["\n\n", "\n", " ", ""]
are used.keep_separator
: a boolean flag that determines whether to keep the separators in the chunks. By default is False.merge_separator
: a string that is used to merge small chunks into larger ones. By default" "
.