“Words are fossils in which the life of the past is embalmed.” — Max Müller
This project analyses semantic drift — the phenomenon where word meanings shift gradually over decades and centuries — using Stanford’s pre-trained historical word embeddings derived from the Google Books Ngram corpus.
| Field | Detail |
|---|---|
| Course | Natural Language Processing (B.Tech) |
| Dataset | Stanford/Google Ngram SGNS Embeddings |
| Time Span | 1800s – 2000s (per-decade granularity) |
| Vocabulary | ~65,000 words per decade |
| Vector Dims | 300-dimensional Skip-gram with Negative Sampling (SGNS) |
# Create virtual environment
python -m venv .venv
# Activate (Windows)
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
python main.py download
This downloads ~1.4 GB from Stanford. Unzips to sgns/ folder automatically.
Manual download: http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip
python main.py serve
# OR just double-click start.bat
Then open frontend/index.html in your browser.
| Tab | Description |
|---|---|
| Word Explorer | Analyse a word — drift score, change type, neighbour shift, t-SNE map |
| Timeline | Drift across all available decades for a single word |
| Heat Map | Multi-word comparison heatmap across all decades |
| Top Drifted | Global ranking of most-changed words + word cloud |
| Concept Relations | Track how two concepts converge/diverge over time |
| Case Studies | 8 famous semantic change examples with one-click analysis |
| About | Methodology, tech stack, change type taxonomy |
# Analyse a word pair
python main.py word network --from 1900 --to 1990
# Full timeline for a word
python main.py timeline virus --out results/virus.csv
# Top-30 most drifted words
python main.py top --from 1900 --to 1990 --n 30
# Save t-SNE plot to file
python main.py word india --from 1800 --to 1990 --plot
Hamilton et al. (2016) trained Skip-gram with Negative Sampling (SGNS) on decade-sliced Google Books Ngram data. Each decade has a 300-dimensional vector for every word in the vocabulary.
drift(w, t₁, t₂) = 1 − cosine_similarity(v_w^t₁, v_w^t₂)
Range [0, 1]: 0 = stable, 1 = completely shifted
Since embedding spaces trained independently are not inherently aligned, we use SVD-based Procrustes rotation to map the source space into the target space before computing drift.
We compare the top-K nearest-neighbour sets of a word across two eras and compute Jaccard overlap. This captures contextual change independently of vector direction.
| Type | Criterion | |——|———–| | Stable | Jaccard > 0.7 | | Broadening | gained » lost (1.5× ratio) | | Narrowing | lost » gained (1.5× ratio) | | Shifting | general context replacement |
2-D projection of the combined neighbourhood (1900 + 1990 contexts) showing which words were shared, gained, or lost.
| Word | Era | Type |
|---|---|---|
| broadcast | 1870→1930 | Broadening (seeds→media) |
| computer | 1900→1980 | Domain shift (human→machine) |
| virus | 1900→1990 | Metaphorical extension (bio→digital) |
| network | 1900→1990 | Broadening (wires→social/digital) |
| awful | 1800→1990 | Pejoration (awe-inspiring→very bad) |
| nice | 1800→1990 | Amelioration (foolish→pleasant) |
| artificial | 1900→1990 | Domain extension (+AI cluster) |
D:\NLP\
├── main.py # CLI entry-point
├── app.py # Flask REST API
├── requirements.txt
├── start.bat # One-click Windows launcher
│
├── core/
│ └── embeddings.py # Data loading, Procrustes alignment
│
├── analysis/
│ ├── drift_analysis.py # Core algorithms
│ ├── visualizations.py # All charts (returns base64 PNGs)
│ └── case_studies.py # Curated case study metadata
│
├── frontend/
│ ├── index.html # Dashboard UI
│ ├── style.css # Dark glassmorphism design system
│ └── app.js # Client-side logic
│
├── tests/
│ └── test_drift.py # 12 unit tests (no SGNS data needed)
│
├── sgns/ # ← Downloaded here (not in git)
│ ├── 1900-vocab.pkl
│ ├── 1900-w.npy
│ └── ...
│
└── results/ # Auto-created: saved plots / CSVs
© B.Tech NLP Major Project — Semantic Drift Analysis Over Time