Data Curation and Preprocessing

The model's expertise is derived from a custom-curated dataset, data/emergency_dataset.jsonl. Each entry is a JSON object containing an instruction (an emergency-related question) and a high-quality output (a safe, step-by-step answer).

Before training, this data is formatted using the format_chat_template function in train_pipeline.py. This function applies the model's official chat template, structuring the data into the conversational format (<start_of_turn>user...<end_of_turn>...) that the instruction-tuned base model was trained on. This alignment is critical for effective learning.

Keyboard shortcuts

Gemergency iOS app docs

Data Curation and Preprocessing