Base Model: unsloth/gemma-3n-E2B-it
We selected Gemma, a family of lightweight, state-of-the-art open models from Google, as our foundation. Specifically, we chose the unsloth/gemma-3n-E2B-it
variant.
- Architecture: Gemma models are based on the same decoder-only Transformer architecture as the Gemini models. The "3n" variant is a multimodal model, equipped with a Vision Language Encoder, making it capable of processing both text and image inputs. While this project focuses on text-to-text fine-tuning, the multimodal foundation offers a clear path for future expansion (e.g., analyzing photos of injuries).
- Training and Capabilities: The
-it
suffix signifies that the model is Instruction Tuned. Following its extensive pre-training on a diverse corpus of up to 6 trillion tokens of text and code, it underwent supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to enhance its ability to follow instructions and engage in helpful dialogue. - Known Limitations: As with all LLMs, Gemma is not immune to generating factually incorrect or biased information. Google's official documentation notes that while safety filters are in place, the model may not always capture nuances and can sometimes produce responses that are plausible but incorrect. This limitation was a primary motivator for our targeted fine-tuning, aiming to instill domain-specific accuracy for emergency scenarios.
References