Computer Science › AI / ML › NLP for Low-Resource Languages › Nepali NLP
Low-resource Nepali instruction-tuning corpora curation
INTERMEDIATE ~9 monthsNLPNepaliLLMInstruction tuning
OVERVIEW
Build a reproducible pipeline that converts Nepali web text + glossed parallel data into high-quality instruction-tuning pairs.
RESEARCH GAP
Existing Nepali corpora are not instruction-formatted; translation-based instruction sets suffer cultural drift.
SUGGESTED METHODOLOGY
- Web-crawl + dedup with MinHash.
- Self-instruct via gpt-style bootstrap with manual cultural-grounding review.
- Benchmark on Nepali Trivia + safety eval.
LITERATURE REVIEW
Reviews FLAN, Alpaca, and Bactrian-X corpus pipelines.
RELEVANT PAPERS
- Self-InstructWang et al. · 2023