Transformer-Based Retrieval

Supervisor: Dr. Jian Zhu · UBC · Dec 2025 – Present

This project investigates how reasoning can be integrated into late-interaction retrieval models. Working from the ColBERT family of architectures, the goal is to improve retrieval quality on complex, multi-hop, and reason-requiring queries — settings where standard bi-encoder and sparse models fall short. The work involves model implementation, HPC-based training on GPU clusters, and systematic evaluation across standard retrieval benchmarks.

ColBERTSPLADENeural IRPyTorchHPC / H100BEIRMS MARCO

Presentations

Sparse-to-Dense Retrieval on BRIGHT: SPLADE Retrieval with ColBERT Reranking

Canadian AI 2026 · Responsible AI Track · Poster / 3MT

NSERC CREATE ScholarshipMay 2026

This work evaluates a sparse-to-dense retrieval pipeline on the BRIGHT benchmark, a challenging reasoning-intensive retrieval dataset. We use SPLADE as a first-stage retriever over the biology domain, followed by ColBERT as a late-interaction reranker. Our work reproduces and stress-tests this pipeline under realistic evaluation conditions, examining where sparse retrieval succeeds and where denser reranking is necessary to close the performance gap. The study contributes a reproducibility perspective on the interaction between sparse and dense retrieval in complex, knowledge-intensive tasks.

Best Recall@10

0.348

ColBERT

Best nDCG@10

0.309

SPLADE + ColBERT

Best MRR

0.419

ColBERT

Results — BRIGHT Biology · 103 Queries

PipelinenDCG@10Recall@10MRRMAP

SPLADE

First-stage only, no reranking

0.2180.2540.291

Two-stage pipeline

SPLADE + ColBERT (sparse → dense reranker)

0.3090.3450.4110.255

Dense baseline

ColBERT dense only (full corpus, no filter)

0.3080.3480.4190.255

Best values per metric are in bold. ★ marks the proposed pipeline.

SPLADEColBERTBRIGHT BenchmarkSparse-to-DenseRerankingReproducibility