ZMongoRetriever

ZMongoRetriever is a Python library designed to facilitate the retrieval, processing, and encoding of documents from MongoDB collections. It's especially suited for handling large datasets that require chunking and embedding for advanced machine learning applications. Through an elegant interface, it supports document splitting, custom encoding with OpenAI models, and direct integration with MongoDB databases.

Features

Document Retrieval: Seamlessly fetch documents from MongoDB collections.
Dynamic Chunking: Split documents into manageable chunks based on character count or embedding size.
Embedding Support: Encode document chunks using OpenAI's embedding models for deep learning tasks.
Flexible Configuration: Customize chunk sizes, token overlaps, and database connections to fit your project needs.
Metadata Conversion: Convert JSON to structured metadata for enhanced document handling.

Installation

Before you begin, ensure you have MongoDB and Python 3.6+ installed on your system. Clone this repository or download the ZMongoRetriever module directly. Dependencies can be installed via pip:

pip install -r requirements.txt

Environment Variable File

You must have a file named '.env' with the appropriate values for the following:

OPENAI_API_KEY=___

Quick Start

To get started with ZMongoRetriever, follow these steps:

Initialize MongoDB Connection:

from pymongo import MongoClient
from zconstants import MONGO_URI

client = MongoClient(MONGO_URI)

Create an Instance of ZMongoRetriever:

from zmongo_retriever import ZMongoRetriever

retriever = ZMongoRetriever(mongo_uri=MONGO_URI, db_name='your_database', collection_name='your_collection')

Retrieve and Process Documents:

object_ids = ["65f28c8103fc21342e2dc04d", "65f28c8403fc21342e2dc064"]
documents = retriever.invoke(object_ids=object_ids, page_content_key='report.details.content')

Advanced Usage

Encoding Document Chunks - Not Fully Implemented

Enable encoding to process document chunks with OpenAI's embeddings:

retriever.use_encoding = True
encoded_chunks = retriever.invoke(object_ids=object_ids, page_content_key='report.details.content')

Custom Chunking and Overlaps

Customize the chunk size and token overlap for nuanced control over document processing:

retriever.chunk_size = 1024  # Characters
retriever.overlap_prior_chunks = 2  # Number of chunks repeated in a subsequent Document list

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
chroma_db/xyzzy_1		chroma_db/xyzzy_1
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
UNDERSTANDING_KEY_SELECTION.md		UNDERSTANDING_KEY_SELECTION.md
ZMONGO_EMBEDDER.md		ZMONGO_EMBEDDER.md
requirements.txt		requirements.txt
system_manager.py		system_manager.py
zconstants.py		zconstants.py
zmongo_retriever.py		zmongo_retriever.py

License

CentralFloridaAttorney/zmongo_retriever

Folders and files

Latest commit

History

Repository files navigation

ZMongoRetriever

Features

Installation

Environment Variable File

Quick Start

Advanced Usage

Encoding Document Chunks - Not Fully Implemented

Custom Chunking and Overlaps

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages