This is a two-part system for creating and exploring a semantic search database using ChromaDB and Sentence Transformers, visualized with Streamlit.
- Script 1 (Embedding & Storage): Loads Markdown files, generates embeddings, and stores them in a ChromaDB collection.
- Script 2 (Streamlit Explorer): Provides a web interface to view ChromaDB collections, perform semantic searches, and preview documents.
Here's a README.md
to explain both scipts:
This project provides a two-part system to build and interact with a semantic search engine powered by ChromaDB, Sentence Transformers for embeddings, and Streamlit for the user interface.
- π
ingest_markdown.py
: Loads Markdown documents from specified site folders, generates text embeddings, and stores them in ChromaDB collections. - π
app.py
(Streamlit App): A web application to explore the created ChromaDB collections, perform semantic searches, and preview documents.
-
π Prepare Data (Manual Step):
- Organize your Markdown (.md) files into sub-folders within a base directory (e.g.,
./sites_markdown/
). Each sub-folder will become a separate ChromaDB collection.
- Organize your Markdown (.md) files into sub-folders within a base directory (e.g.,
-
β‘οΈ Ingestion (Script 1:
ingest_markdown.py
):- Scans the base directory for site-specific sub-folders containing Markdown files.
- For each site:
- Reads all
.md
files. - Uses
SentenceTransformer("all-distilroberta-v1")
to generate embeddings for each document's content. - Creates (or gets) a ChromaDB collection named after the site folder (e.g.,
my_site_docs
). - Adds the documents, their embeddings, and unique IDs to the collection.
- Reads all
- The ChromaDB data is persisted to disk (in
./chroma_db/
).
-
π Exploration (Script 2:
app.py
):- Starts a Streamlit web application.
- Connects to the existing ChromaDB instance (
./chroma_db/
). - Displays Collection Overview: Shows a table of all available collections, their document counts, and maximum chunk sizes.
- Semantic Search: Allows users to select a collection, enter a search query, and retrieve the most semantically similar documents.
- Collection Preview: Lets users select a collection and preview a limited number of its documents.
This script processes Markdown files, generates embeddings, and stores them in ChromaDB.
π― Purpose:
- To populate a ChromaDB vector database with textual content from Markdown files, making it searchable semantically.
- Organizes documents into collections based on their source site/folder.
β¨ Features:
- π Folder-based Organization: Reads Markdown files from sub-folders within a base directory. Each sub-folder name becomes a collection name.
- π§ Embedding Generation: Uses
SentenceTransformer("all-distilroberta-v1")
for creating dense vector embeddings. - πΎ Persistent Storage: Uses
chromadb.PersistentClient
to save the database to disk (./chroma_db/
). - β Idempotent Collection Creation: Uses
get_or_create_collection
to avoid errors if collections already exist (though re-running will add duplicate documents if not managed separately). - π Unique Document IDs: Assigns a UUID to each document added.
- π Progress Indication: Shows a progress bar during embedding generation.
π Prerequisites:
- π Python 3.x
- π¦ Python libraries:
chromadb
,sentence-transformers
pip install chromadb sentence-transformers
π§ Configuration:
BASE_DIRECTORY
(inif __name__ == "__main__":
): Set this to the path of your root folder containing site-specific sub-folders with Markdown files. (Default:./sites_markdown
)
π How to Use:
- πΎ Save the script: e.g., as
ingest_markdown.py
. - π Prepare Data:
- Create a base directory (e.g.,
sites_markdown
). - Inside it, create sub-folders for each "site" or category of documents (e.g.,
sites_markdown/blog_posts
,sites_markdown/product_docs
). - Place your
.md
files within these sub-folders.
- Create a base directory (e.g.,
- βοΈ Configure
BASE_DIRECTORY
in the script if it's not./sites_markdown
. βΆοΈ Execute the script:python ingest_markdown.py
- This will create/populate the
./chroma_db
directory.
- This will create/populate the
A web application to interact with the ChromaDB collections created by the ingestion script.
π― Purpose:
- Provide a user-friendly interface to perform semantic searches on the embedded documents.
- Allow exploration and preview of the content stored in ChromaDB.
β¨ Features:
- π Collection Overview: Lists all ChromaDB collections with basic statistics (document count, max chunk size).
- π Semantic Search:
- Select a collection.
- Enter a natural language query.
- Specify the maximum number of results.
- Displays results sorted by similarity, showing document snippets and similarity scores.
- π Collection Preview: Allows browsing a sample of documents from a selected collection.
- β‘οΈ Caching: Uses
@st.cache_resource
for efficient initialization of ChromaDB client and embedder. - β¨ User-Friendly UI: Built with Streamlit for an interactive experience.
π Prerequisites:
- π Python 3.x
- π¦ Python libraries:
streamlit
,chromadb
,sentence-transformers
,pandas
pip install streamlit chromadb sentence-transformers pandas
- β
ChromaDB Data: Requires the
./chroma_db
directory to be populated by theingest_markdown.py
script first.
π How to Use:
- πΎ Save the script: e.g., as
app.py
. - β
Ensure Data Exists: Make sure you have run
ingest_markdown.py
and the./chroma_db
directory exists and contains data. βΆοΈ Run the Streamlit App: Open a terminal or command prompt, navigate to the script's directory, and run:streamlit run app.py
- This will open the web application in your default browser.
- π οΈ Setup:
- Install Python.
- Install all required Python dependencies:
pip install streamlit chromadb sentence-transformers pandas
- Save both Python scripts (
ingest_markdown.py
andapp.py
) in the same directory.
- π Prepare Your Markdown Data:
- Create a base directory (e.g.,
sites_markdown
). - Organize your
.md
files into sub-folders within this base directory (e.g.,sites_markdown/site_A
,sites_markdown/site_B
).
- Create a base directory (e.g.,
- β‘οΈ Run Ingestion Script:
- Modify
BASE_DIRECTORY
iningest_markdown.py
if needed. - Execute:
python ingest_markdown.py
- This will create/update the
./chroma_db
folder.
- Modify
- π Launch Streamlit Explorer:
- Execute:
streamlit run app.py
- Interact with your semantic search engine via the web browser.
- Execute: