June 07, 2026 11:12 pm (IST)
Follow us:
facebook-white sharing button
twitter-white sharing button
instagram-white sharing button
youtube-white sharing button
Cockroach Janta Party protest: Six detained as Delhi Police moves to avert clashes at Jantar Mantar | Sonam Wangchuk joins Cockroach Janta Party's Jantar Mantar protest, backs call for Dharmendra Pradhan's resignation | 'Dharmendra Pradhan must resign': Cockroach Janta Party founder Abhijeet Dipke joins Jantar Mantar protest | Sachin Tendulkar's long-standing record falls as 15-year-old Vaibhav Sooryavanshi earns India call-up | Bengal Governor RN Ravi highlights India's record defence exports, pushes for military strength by 2047 | 'Hope for best, prepare for worst': Sonam Wangchuk warns of hunger strike if CJP protesters are arrested | Middle East standoff: US intercepts Iranian missiles, drones close to Strait of Hormuz | After Annamalai exit, BJP gives up Andhra Rajya Sabha seat in NDA rejig | K. Annamalai quits BJP, triggers speculations over his new party | RBI hits pause button again! Repo rate remains unchanged at 5.25% amid global turmoil
Google
Google logo. Photo: Unsplash

Google drops Gemma 4 12B: A game-changing AI that runs on your laptop

| @indiablooms | Jun 07, 2026, at 05:18 pm

Google has announced the launch of Gemma 4 12B, a dense multimodal model featuring a unified, encoder-free architecture.

Gemma 4 12B marks several key milestones for local AI development. According to Google’s blog post, it introduces a multimodal encoder-free design, eliminating the need for heavy, multi-stage vision and audio encoders. Instead, multimodal inputs are fed directly into the LLM backbone, helping reduce latency in processing images, audio, and other data types.

The company also described it as its first medium-sized model with native audio input. Within the Gemma family, audio capabilities were previously limited to smaller edge-focused models such as E4B. With Gemma 4 12B, Google expands audio understanding to a more capable, general-purpose model.

Positioned as developer-friendly and locally deployable, the model is compact enough to run on laptops equipped with 16GB VRAM or unified memory. To further optimize local inference speed, Google is also releasing a dedicated multi-token prediction (MTP) model.

For the first time, Google is also introducing downloadable macOS desktop applications, enabling developers to experience fully local, real-time multimodal interaction—including voice and visual inputs—on consumer-grade devices.

In its technical overview, Google noted that traditional multimodal systems typically rely on separate, frozen encoders for different modalities, such as vision encoders (150M parameters for edge models and 550M for medium models) and audio encoders (around 300M parameters in smaller variants like E2B and E4B).

Google claims Gemma 4 12B delivers strong performance across a range of capabilities, including automatic speech recognition, agentic reasoning, speaker diarization, video understanding, and coding tasks.

Support Our Journalism

We cannot do without you.. your contribution supports unbiased journalism

IBNS is not driven by any ism- not wokeism, not racism, not skewed secularism, not hyper right-wing or left liberal ideals, nor by any hardline religious beliefs or hyper nationalism. We want to serve you good old objective news, as they are. We do not judge or preach. We let people decide for themselves. We only try to present factual and well-sourced news.

Support objective journalism for a small contribution.