Knowledge Graph Question Answering

▶ Live demo · GitHub repo · Project report (PDF) · Slides

What this is

Most “GraphRAG beats RAG” demos are confounded — the graph pipeline quietly also gets a reranker, a different corpus, or even leaks the answer into the prompt. This project runs a 4-arm ablation on PubMedQA where every layer (corpus, chunking, embedder, reranker, prompt, LLM, top-k, seed) is held constant, so the accuracy change between adjacent arms is attributable to exactly one component — verified with a paired McNemar test.

Architecture and 4-arm ablation

Results (n = 200, seed 42)

Arm	Accuracy	Macro F1	Adds
`plain`	30.0%	29.7%	baseline chunk RAG
`plain_rr`	37.0%	35.2%	+ cross-encoder reranker
`graph`	59.5%	50.5%	+ parent-paper expansion
`graph_concepts`	57.5%	50.0%	+ MeSH concept hop

4-arm ablation

The honest finding: the graph’s decisive, statistically significant win comes from parent-document expansion (plain_rr → graph: +22.5 pp, McNemar p < 0.0001). The reranker helps but isn’t significant (+7 pp, p = 0.08), and MeSH concept-hop expansion does not help on this single-abstract dataset (−2 pp, p = 0.69) while costing ~5× the latency. The graph helps by deepening context, not by broadening it.

Explore

Live results dashboard — interactive bars, significance tests, per-class breakdown
Source code & README — package, scripts, tests, CI
Project report (PDF) and slides