Note-Splicer is a comprehensive Python-based system designed to transform unstructured personal notes (such as lecture transcripts, study materials, or research documents) into a structured, searchable knowledge base using advanced AI techniques. The system leverages Retrieval-Augmented Generation (RAG) to enable intelligent querying of your notes, allowing you to ask questions and receive coherent, context-aware answers based on your personal knowledge repository.
- Note Processing: Converts raw text files into structured JSON summaries using generative AI
- Data Compilation: Splices individual processed notes into a unified master file
- Vector Database: Builds a persistent ChromaDB vector database for semantic search
- RAG Querying: Implements a RAG pipeline for natural language queries against your notes
- Multiple LLM Support: Compatible with various Large Language Models (e.g., Gemini, Deepseek via Ollama)
- Streamlit UI: Interactive web interface for querying the knowledge base
- Modular Architecture: Organized into separate components for processing, database creation, and querying
note-splicer/
├── requirements.txt # Python dependencies
├── Spliced_Notes.json # Compiled master notes file
├── Spliced_Notes.txt # (Empty) Alternative text format
├── note_summarizer/ # Core processing components
│ ├── README.md # Detailed usage guide
│ ├── requirements.txt # Component-specific dependencies
│ ├── extracted_notes.txt # Raw extracted notes
│ ├── spliced_notes_schema.json # JSON schema for notes structure
│ ├── Spliced_Notes.json # Component-level compiled notes
│ ├── structure.json # Additional structure definitions
│ ├── structure_example.json # Example structure
│ ├── training_dataset.jsonl # Training data for fine-tuning
│ ├── Generated_Notes/ # AI-generated structured notes
│ │ └── Module_1/
│ │ └── Module_1/
│ │ ├── Module_1_Part_1.json
│ │ ├── Module_1_Part_2.json
│ │ └── ...
│ └── note_summarizer/ # Scripts directory
│ ├── create_finetune_dataset.py # Dataset creation for model fine-tuning
│ ├── create_vector_db.py # Vector database builder
│ ├── extract_notes.py # Note extraction utilities
│ ├── finetune_lora.py # LoRA fine-tuning script
│ ├── note_processor.py # Main note processing script
│ ├── prepare_dataset.py # Dataset preparation
│ ├── query_rag.py # RAG query interface
│ ├── run_poli_sci_qa.py # Political science QA runner
│ └── splice_json.py # JSON splicing utility
├── src/ # Source code directory
│ ├── format_notes.py # Note formatting utilities
│ └── note_summarizer/ # Alternative script location
│ ├── create_finetune_dataset.py
│ ├── create_vector_db.py
│ ├── extract_notes.py
│ ├── finetune_lora.py
│ ├── note_processor.py
│ ├── prepare_dataset.py
│ ├── query_rag.py
│ ├── run_poli_sci_qa.py
│ └── splice_json.py
├── Lecture Transcripts/ # Raw input data
│ └── Organized_Notes/ # Organized raw transcripts
│ └── Module_1/
│ └── Module_1/
│ ├── Module_1_Part_1.txt
│ ├── Module_1_Part_2.txt
│ └── ...
├── final-notes/ # Final processed outputs
│ ├── Final Q1 Citations.md
│ └── Final Q2 Citations.md
├── Note_Training/ # Training data organization
│ └── Organized_Notes/ # Training transcripts
├── Simple_Notes_Organized/ # Simplified note organization
│ ├── Spliced_Notes.txt
│ └── Organized_Notes/ # Simplified organized notes
├── textbook_cleanup/ # Textbook processing utilities
│ ├── extracted_text cleaned.txt
│ ├── extracted_text.txt
│ └── textbook_cleaner.py
└── vector_db/ # Vector database storage
├── chroma.sqlite3 # ChromaDB database file
└── [collection_id]/ # Collection-specific data
-
Clone the repository:
git clone https://github.com/Integer-Conversion-Error/note-splicer.git cd note-splicer -
Set up a virtual environment:
python -m venv .venv .\.venv\Scripts\activate # On Windows # or source .venv/bin/activate # On macOS/Linux
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables (optional, for API keys): Create a
.envfile in the project root with your API keys:GROQ_API_KEY=your_groq_api_key_here GOOGLE_API_KEY=your_google_api_key_here
-
Process Raw Notes:
python src/note_summarizer/note_processor.py
-
Compile Master JSON:
python src/note_summarizer/splice_json.py
-
Build Vector Database:
python src/note_summarizer/create_vector_db.py
-
Query Your Notes:
python src/note_summarizer/query_rag.py "Your question here"
For an interactive experience:
streamlit run src/note_summarizer/query_rag.pyThis launches a web interface where you can ask questions about your notes and receive AI-generated answers.
Here's an example of raw lecture transcript text from Lecture Transcripts/Organized_Notes/Module_1/Module_1/Module_1_Part_1.txt:
Hi, everyone, and welcome to this first official course of Poll 2101, Introduction to Canadian Politics.
So, in this first model, what I want to talk about is, it's a broad theme of the Canadian constitution and the Canadian political community.
So, in many ways, what I want to see is the evolution of Canada over time from what wasn't called Canada at first, but that gradually became Canada in the 19th century.
But before we talk about some of these key moments, constitutional moments, I want to talk about a key element of when we talk about Canadian politics and the political life of Canada.
That might seem obvious, but it's important to always keep in mind, which is that geography matters a great deal in politics, especially when comparing different countries, and territory matters a great deal in Canadian politics.
This is a massive country, really large, you know, it takes the same time to go from Montreal to Vancouver, then Montreal to Paris on a plane.
And so, size matters, why?
Because it created different regional patterns of settlement, different regional interests, very different, just sort of physical geography, okay?
And it also matters because of the distribution of the population that is essentially close to the American border in the south, that also creates an important difference between people who live in the north and people who live in the south in many ways.
But I think really one of the main important differences is the economy, and we're going to get back to that frequently.
The resources are not the same in each region, they have very different economic interests, they have different economic markets, whether it's Asia, in British Columbia, or more closer to Europe, in Eastern Canada.
So that creates significantly different interests, okay?
And that's a great, really important.
And that was this notion of different interests, different territory, or different regions, sorry I should say, is very much related to the importance of resources, natural resources in the history of Canada and in Canadian politics.
And that was a key aspect, and one of the most important scholars of Canadian politics in the 20s, 30s, and 40s, Harold Dennis, who is an economic historian slash political economist, who developed this theory that is called the Staple Thesis Theory.
So what is the Staple Thesis Theory?
It is basically an approach that stipulates that Canada was economically dominated by the export of a series of staples to outside markets, whether first Great Britain, and then the United States.
So that includes fur, fish, later wheat, and today oil in many ways.
So you cannot understand the key aspect of Canadian politics without understanding the fact that the Western provinces, especially Alberta, is so dependent on oil.
After processing with the AI system, the raw text is transformed into structured, summarized notes in JSON format (from note_summarizer/Generated_Notes/Module_1/Module_1/Module_1_Part_1.json):
{
"source_file": "Module_1_Part_1.txt",
"notes": [
{
"id": 1,
"note": "Geography is a critical factor in Canadian politics, shaping regional patterns of settlement, interests, and the distribution of its population."
},
{
"id": 2,
"note": "Canada's immense size contributes to distinct regional economic interests and markets, for example, Asia for British Columbia and Europe for Eastern Canada."
},
{
"id": 3,
"note": "Harold Dennis, a 20th-century economic historian and political economist, developed the Staple Thesis Theory to explain Canadian economic development."
},
{
"id": 4,
"note": "The Staple Thesis Theory posits that Canada's economy was dominated by the export of staples (like fur, fish, wheat, and oil) to external markets, initially Great Britain and later the United States."
},
{
"id": 5,
"note": "The Staple Thesis helps explain political conflicts, such as those between the West (dependent on oil) and the central power, and regional partisan dominance (Conservatives in the West, Liberals in the East)."
},
{
"id": 6,
"note": "Before European settlement, Canada was home to over 50 distinct Indigenous nations and cultures."
},
{
"id": 7,
"note": "Early European contact with Indigenous peoples was initially viewed as interactions between nations."
},
{
"id": 8,
"note": "The Royal Proclamation of 1763 is a pivotal document defining the relationship between the British/Canadian Crown and Indigenous peoples."
},
{
"id": 9,
"note": "The Royal Proclamation of 1763 forbade the purchase or settlement of land in areas designated as 'Indian Territory' without Crown approval, requiring treaties to be negotiated."
},
{
"id": 10,
"note": "Over time, as European settlement expanded, these treaties were often not respected, leading to the displacement, depopulation, and genocide of Indigenous peoples, whose numbers significantly declined in the 19th century."
}
]
}Using the RAG system to query the processed notes:
Query: "What is the Staple Thesis Theory?"
Response: The Staple Thesis Theory, developed by Harold Dennis, posits that Canada's economy was dominated by the export of staples (like fur, fish, wheat, and oil) to external markets, initially Great Britain and later the United States. This theory helps explain political conflicts, such as those between the West (dependent on oil) and the central power, and regional partisan dominance.
- Python: Core programming language
- ChromaDB: Vector database for semantic search
- Sentence Transformers: For generating text embeddings
- LiteLLM: Unified interface for multiple LLMs
- Streamlit: Web interface for querying
- PyTorch & Transformers: For AI/ML operations
- Hugging Face Libraries: Model management and fine-tuning
- Pandas & NumPy: Data processing
- JSON Schema: Data validation
The system is designed to work with personal notes, including:
- Lecture transcripts (as shown in examples)
- Study materials
- Research documents
- Textbook excerpts
- Personal annotations
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
This project is open-source. Please check the repository for licensing information.
For questions or support, please open an issue on the GitHub repository.