LLM-Powered Research Agent – SOC

LLM-Powered Research Agent Infographic.png

👋 Welcome!

Hey everyone!

This page will be your central resource hub throughout the Summer of Code program. I Prince (Elec) with Anushka(Civil) and Swaprabha(EP) , will be your mentors for this journey. Together, we'll build an AI agent that reads research papers, extracts knowledge, and generates literature reviews or hypotheses using LLMs.

We'll update this page weekly with new goals, resources, and tasks. Let's learn, build, and ship something awesome

Week 1-2: PDF Parsing + Embedding Basics

Goal of the Week:

Parse research papers (PDFs)
- Before doing anything smart, we need to extract the raw text from the PDFs of research papers.
We'll use tools like:
- PyMuPDF or
- pdfplumber
Goal: Convert a research paper into raw text (title, abstract, paragraphs, sections).
Clean + chunk text
- Once we have the text:
- Clean it (remove headers, footers, strange symbols, blank spaces)
- Chunk it: break the long document into smaller parts (like paragraphs or 200-word sections)
This is important because LLMs and vector databases work better with smaller, digestible pieces.

→Example chunk:

“In this study, we explore how transformer models can be used for document understanding..."

Use tools like:
- LangChain’s RecursiveCharacterTextSplitter
- LlamaIndex’s document chunkers
Generate embeddings
- Now, we convert each chunk of text into a vector (a long list of numbers) that captures its meaning — called an embedding.
You can use:
- OpenAI Embeddings (text-embedding-3-small)
- Sentence Transformers (all-MiniLM-L6-v2, BGE models)
Why embeddings? So that similar text has similar vectors → which lets us later search by meaning, not just keywords.
Store them in a vector DB (FAISS or Weaviate)
- We now store the chunks and their embeddings in a vector database.
- FAISS: simple, fast, runs locally
- Weaviate: more powerful, web-based, supports filters & metadata
These databases let us search for relevant text chunks later, based on a user's question.

Later on:

When the user asks: “What methods did this paper use?” →

We convert the question to an embedding → find similar chunks in the DB → send those chunks to GPT to answer.

🔧 Tools & Libraries(week-1&2):

→Parse research papers (PDFs)

PyMuPDF – for PDF text extraction
- https://pymupdf.readthedocs.io/en/latest/tutorial.html
- https://youtube.com/playlist?list=PLHlrXTRZkTLBQ7k06CFoP3-gayAmfp-GG&si=B6WoVnDMqkx0w130
pdfplumber – alternative parser
- https://azhar-sayyad.medium.com/a-step-by-step-guide-to-parsing-pdfs-using-the-pdfplumber-library-in-python-c12d94ae9f07
- https://www.youtube.com/watch?v=nk7e-Id6O3E

👋 Welcome!

Week 1-2: PDF Parsing + Embedding Basics

🔧 Tools & Libraries(week-1&2):

→Parse research papers (PDFs)

→LangChain Text Splitters – for chunking