Research Idea Center
Computer Science › AI / ML › NLP for Low-Resource Languages › Nepali NLP

Low-resource Nepali instruction-tuning corpora curation

INTERMEDIATE ~9 monthsNLPNepaliLLMInstruction tuning
OVERVIEW

Build a reproducible pipeline that converts Nepali web text + glossed parallel data into high-quality instruction-tuning pairs.

RESEARCH GAP

Existing Nepali corpora are not instruction-formatted; translation-based instruction sets suffer cultural drift.

SUGGESTED METHODOLOGY
  1. Web-crawl + dedup with MinHash.
  2. Self-instruct via gpt-style bootstrap with manual cultural-grounding review.
  3. Benchmark on Nepali Trivia + safety eval.
LITERATURE REVIEW

Reviews FLAN, Alpaca, and Bactrian-X corpus pipelines.

RELEVANT PAPERS
  • Self-Instruct
    Wang et al. · 2023

Made with Emergent