RAG
Reader
The reader is responsible for preparing the documents for ingestion. Every reader in Dingo is a subclass of the agent_dingo.rag.base.BaseReader
class. The reader should implement the read
method, which takes a source identifier (e.g. a URL, a file path, etc.) and returns a list of documents. The document is an instance of the agent_dingo.rag.base.Document
class, which contains the content
as a string and the metadata
as a (non-nested) dictionary.
List reader
The ListReader
is a simple reader that reads the documents from a list of strings. It is useful for the cases when the content is already pre-loaded into memory.
from agent_dingo.rag.readers.list import ListReader
reader = ListReader()
docs = reader.read(["Document 1", "Document 2"])
print(docs)
# [Document(content='Document 1', metadata={'source': 'memory'}), Document(content='Document 2', metadata={'source': 'memory'})]
PDF reader
The PDFReader
reads the documents from a PDF file. It uses PyPDF2
library to extract the text from the PDF file and returns a list of documents corresponding to the pages of the PDF.
from agent_dingo.rag.readers.pdf import PDFReader
reader = PDFReader()
docs = reader.read("path/to/pdf/file.pdf")
print(docs)
# [Document(content='...' metadata={'source': 'file.pdf', 'page': 1})]
Webpage reader
The WebpageReader
reads the documents from a webpage. It uses the requests
library to download the webpage and the beautifulsoup4
library to extract the text from the HTML content.
from agent_dingo.rag.readers.web import WebpageReader
reader = WebpageReader()
docs = reader.read("https://en.wikipedia.org/wiki/Python_(programming_language)")
print(docs)
# [Document(content='...' metadata={'source': 'https://en.wikipedia.org/wiki/Python_(programming_language)'})]
Word reader
The WordDocumentReader
reads the documents from a Word file. It uses the python-docx
library to extract the text from the Word file and returns a list of documents corresponding to the paragraphs of the Word document.
from agent_dingo.rag.readers.word import WordDocumentReader
reader = WordDocumentReader()
docs = reader.read("path/to/word/file.docx")
print(docs)
# [Document(content='...' metadata={'source': 'file.docx', 'paragraph': 1})]