Audivine

Transform Spotify listening data and lyrics into collage-style visual artwork for fans and artists.

Duration: 1 month

Team Size: 1

Project Overview

Audivine converts musical identity into visuals. It ingests Spotify top tracks and lyrics, summarizes the narrative with an LLM, and generates collage-style artwork with a fine-tuned Stable Diffusion XL model. For artists, the same pipeline turns lyrics/metadata into marketing-ready visuals that tell the story of a song or release.

Problem Statement

Artists struggle to turn songs/lyrics into compelling visual stories for promotion.
Listeners lack a personalized way to visualize their musical identity.

Project Details

Data & Captions

Custom collage-style dataset curated from Unsplash; 105 image–caption pairs. Initial BLIP captions refined manually and with GPT-4o-mini.

Model

SDXL (stabilityai/stable-diffusion-xl-base-1.0) fine-tuned via DreamBooth + LoRA; training on Google Colab A100 (~4 hours, ~3000 epochs).

Inference

Prompts from lyrics/metadata using GPT-4o-mini; generation via Replicate (fine-tuned SDXL) or Stability AI API; streamed back to the client.

Responsive Design

Spotify (top tracks), LyricsOVH (lyrics), WebSocket backend on EC2; negative prompts to avoid violent/undesired content.

Tech Stack

Frontend

React.js 18 with Hooks
TypeScript for type safety
Styled Components with Bootstrap

Backend

Flask/FastAPI on AWS EC2 · WebSockets

ML/AI

Stable Diffusion XL (fine-tuned with DreamBooth + LoRA)
Hugging Face Diffusers
BLIP + GPT-4o-mini for summaries & prompt generation

Services

Replicate (fine-tuned model hosting)
Stability AI API (naïve baseline)
Colab for Model Training
Spotify API + LyricsOVH

Key Features

1) Listener Mode

Fetch top tracks → fetch lyrics → summarize narratives → generate prompts → create collage-style art.

2) Artist Mode

Use lyrics + album artwork + form inputs to steer prompts toward marketing-ready visuals.

Results & Evaluation

Stylistic Shift

Fine-tuned SDXL yields noticeably more collage-like, distinctive outputs than the naïve baseline.

User Study

10 participants rated style difference between models: avg 4.2/5 (higher = more different).

Latency

Baseline: ~5.8s/image; Fine-tuned on Replicate: ~11s/image (cold starts + minimal optimization).

Ethics

Images sourced legally from Unsplash; user listening data is not persisted beyond the session. Generated artwork is intended for personal use or artist promotion, not resale.