Back to Projects

MediSeek

A medical chatbot trained on Huberman Lab podcast transcripts — comparing a naïve DeepSeek-7B, an attention LSTM, and a fine-tuned DeepSeek-7B.

Duration: 3 weeks
Models Compared: 3

Demo

Project Overview

MediSeek answers health and wellness questions grounded in Huberman Lab episodes. It provides a choice of three model pathways: a naïve DeepSeek-7B (no tuning), an attention-based LSTM trained on podcast-derived QA pairs, and a fine-tuned DeepSeek-7B optimized via PEFT/LoRA. The goal is to evaluate quality/cost trade-offs and let users pick the best fit.

Problem

  • Users want concise, reliable answers sourced from long, noisy podcast transcripts.
  • General-purpose LLMs can hallucinate or miss topic-specific nuances without targeted tuning.
  • Different users (casual vs. expert) need different response depth and rigor.

Architecture & Solution

Data Collection

Episodes scraped; transcripts pulled via Selenium/Podscribe API; raw text stored under data/raw_transcripts.

Cleaning

Removed metadata, timestamps, ads; normalized text with clean_podcasts.py into data/clean_transcripts.

QA Generation

GPT-4o generated ~30 QA pairs per episode across General, Specific, and Technical levels; stored in data/qa_pairs.

Model Suite

Naïve DeepSeek-7B; attention LSTM encoder-decoder; fine-tuned DeepSeek-7B via PEFT/LoRA with instruction-tuned prompts.

Evaluation

Held-out test set (latest 5 episodes). Metrics: ROUGE-1/2/L, BLEU, METEOR, BERTScore; per-topic visualizations and model comparisons.

Technology Stack

Core

  • Python · PyTorch · Transformers · PEFT/LoRA

Data

  • Selenium · Pandas · NumPy

Eval & Viz

  • ROUGE · BLEU · METEOR · BERTScore · Matplotlib

Key Features

  • Three selectable models for cost/quality trade-offs.
  • Topic-aware QA generation for varied expertise levels.
  • Reproducible evaluation pipeline with HTML report and plots.

Evaluation

We compare the naïve DeepSeek-7B, attention LSTM, and fine-tuned DeepSeek-7B on the latest five episodes not seen during training. Metrics include ROUGE-1/2/L, BLEU, METEOR, and BERTScore, with per-topic breakdowns and model-to-model comparisons. View the full interactive summary here: Evaluation Report (HTML).