Enhancing DistilBERT to predict legislative bill subjects from titles and metadata using factual legal knowledge.
The goal is to classify U.S. legislative bills into subject categories using only metadata—primarily bill titles— by fine-tuning DistilBERT and injecting factual legal knowledge into training. This supports analysts and researchers with fast, consistent tagging across large bill volumes.
Titles and subject categories collected from congress.gov (93rd–118th for train/val; 119th for test).
Label mapping managed via label_mapping.json
; data cleaning in process_data.py
.
Base model distilbert-base-uncased
with a custom classification head; end-to-end fine-tuning using cross-entropy.
Focal property: factual legal knowledge. Promotes better separation of near-neighbor subjects and improved generalization.
Accuracy, macro Precision/Recall/F1, confusion matrices; auxiliary ROUGE, BLEU, and BERTScore for semantic similarity.
Training via Colab (A100) orchestrated in run_legal_bert.ipynb
; metrics logged with Weights & Biases.
We report Accuracy and macro-averaged Precision/Recall/F1 on validation and an independent 119th-Congress test set. To interpret errors, we include per-class confusion matrices and additional semantics-focused scores (ROUGE, BLEU, BERTScore). Our results show that injecting factual legal knowledge significantly improves classification performance, especially on challenging near-neighbor subjects.