Pushing Home Assistant to New Heights with LLM Vision Integration

Over the past few weeks, I’ve been deep in the trenches with LLM Vision Integration in Home Assistant, paired with XAI’s Grok and Google Gemini, to supercharge my smart home’s security and dashboard. The results? A game-changer for eliminating false positives, delivering actionable alerts, and making my home feel like it’s reading my mind. This post dives into my experience, technical lessons, and some unique automation tricks for those of you who love getting under the hood. Buckle up—it’s a long one, but it’s packed with insights for advanced tinkerers!

Why LLM Vision? The Power of Contextual Intelligence

Large Language Model (LLM) Vision Integration brings multimodal AI to Home Assistant, letting it analyze images and videos from cameras to understand scenes, not just detect motion. Unlike traditional motion sensors or even AI camera detection, LLM Vision can parse context—like distinguishing a human from a pet or identifying carried items—making it ideal for precise security and automation. The integration offers three services: llmvision.image_analyzer for static images, llmvision.video_analyzer for recorded clips, and llmvision.stream_analyzer for live feeds. Each has its strengths, but as I learned, reliability varies.

My goal was to enhance my entryway and indoor security in Ecuador, ensuring my dashboard only pops up with relevant camera feeds and notifications are accurate. False positives (empty images or pet triggers) were a constant annoyance, but LLM Vision has all but eliminated them. Here’s how I did it, the challenges I faced, and advice for your own experiments.

My Setup: Three Custom Automations

I built three automations to tackle specific needs, all leveraging LLM Vision for contextual analysis:

Entryway TTS Notifications: Monitors my Reolink camera for motion and doorbell presses. It captures a 5-second video clip, analyzes it with llmvision.video_analyzer (using Gemini), and announces results via kitchen and master bedroom assistants. Doorbell triggers bypass a 60-second cooldown and include a “Doorbell pressed” prefix for clarity. When away, it sends mobile alerts with snapshots.
Indoor Security – Living Area Analysis (Away/Bedtime): Watches my living room/kitchen camera when I’m away or asleep. Gemini Vision analyzes 7-second clips, ignoring my 22-lb dog and 20-lb cat. Confirmed human detections trigger persistent notifications (bedtime) or mobile alerts (away), with a 60-second cooldown and up to two retries on failure.
Entryway Motion Dashboard Pop-Up: This one has to be fast. Triggers on motion or person detection, captures a snapshot, and uses image_analyzer (Grok) to detect humans or animals, ignoring static objects. If activity is confirmed, it displays the camera feed on my living room dashboard for a set time and sends notifications (direct image when away, URL-linked otherwise). A 60-second cooldown prevents spam.

These automations have transformed my smart home. The dashboard no longer shows empty images, voice announcements are spot-on (e.g., identifying our maid or elevator technicians by their tools), and indoor security alerts only fire for humans, not pets. The LLM Vision remember feature, trained with photos of me, my wife, and our dog, recognizes us over half the time, adding a personal touch.

Challenges and Solutions

Getting here wasn’t smooth sailing. Here are the biggest hurdles and how I overcame them:

False Positives: Early on, my dashboard pop-up triggered on any motion, often showing empty images. My Reolink’s AI human detection helped, but it still misfired for shadows or wind. LLM Vision’s contextual analysis (e.g., “Detect humans or animals, ignore static elements”) fixed this. Gemini proved more reliable than Grok for following instructions precisely, especially for concise outputs like “Detected” or “No activity.”
Stream Analyzer Woes: I initially tried stream_analyzer for its speed, feeding live camera streams directly to Gemini. It was simpler (no file creation) and faster, analyzing three frames over 5 seconds. But it failed 5-10% of the time, with errors like “Failed to fetch camera image.” Community forums confirmed this unreliability, so I switched to video_analyzer. It’s slower and more complex (requiring video file creation), but it’s rock-solid.
Grok’s Limitations: Grok is fast—shaving a few seconds off snapshot analysis, which is great for pop-ups—but it can only process images, not videos or streams. I also found it ignored detailed prompts, adding irrelevant details like environmental descriptions. Gemini, by contrast, nailed concise outputs, making it my go-to for voice notifications and video analysis. Using both LLMs split the load, avoiding rate limits during testing.
Blueprint Limitations: The LLM Vision Blueprint is great for testing, but it’s restrictive for advanced use. It lacks access to the response_variable, which is crucial for custom logic. I used the Blueprint to validate my setup (ensuring LLM Vision and AI providers worked), then reverse-engineered it with help from Grok and Gemini to build tailored automations. Their explanations of response_variable (containing response_text, title, and image paths) unlocked possibilities like keyword processing and conditional notifications.

Unique Techniques

Here are a few tricks I’ve developed that might inspire your own projects:

Conditional LLM Instructions: I tailor prompts based on the trigger type. For example, doorbell presses get this prompt:
llm_message: Doorbell button pressed. Our entryway door is left, an elevator is center, and neighbor’s door is right. State: ‘The doorbell has been pressed’. Briefly describe the person and activity of moving subjects, noting number of people and identifiable items carried. Exclude static elements. Provide a concise narrative across all frames, without mentioning frame numbers.

Motion triggers, however, account for elevator mirror reflections to avoid double-counting people. This ensures voice notifications are contextually relevant and actionable.
Dual LLM Redundancy: I split tasks between Grok (snapshots for pop-ups) and Gemini (video analysis for notifications) to avoid rate limits and add failover. If one LLM hits a snag, the other keeps the system running.
Keyword-Driven Logic: I process response_text for keywords like “human” or “package” to filter notifications. I’m exploring reprocessing videos with new instructions if critical items (e.g., “weapon”) are detected, adding layered intelligence.
Smart Voice Announcement Logic: My kitchen assistant skips announcements if the entryway door was opened recently (60-second window), assuming I caused the trigger. The master bedroom assistant only announces doorbell presses when we’re asleep, reducing nighttime disturbances.

Advice for Advanced Tinkerers

If you’re diving into LLM Vision, here’s what I’ve learned:

Skip the Blueprint for Complex Needs: Use it to test your setup, but go manual for full control. Study the Blueprint’s YAML, then use Grok or Gemini to explain response_variable and build custom automations.
Choose Your LLM Wisely: Gemini excels at following precise instructions for notifications or video analysis. Grok is faster for snapshots but limited to images. Test both to find what suits your use case.
Avoid stream_analyzer for Now: Its speed is tempting, but community reports and my experience show it’s prone to errors (e.g., fetch failures). Stick with video_analyzer for reliability, even if it’s slower.
Embrace Iteration: My automations evolved through trial and error. Start simple, use debug notifications to spot logic flaws, and don’t be afraid to rewrite conditions for cleaner results.
Think Big: If you can dream it, Home Assistant, LLM Vision, and the right sensors can probably do it. My next idea? Reprocessing videos based on initial results (e.g., for specific people or items) to add deeper intelligence.

The Wow Factor

Beyond the tech, LLM Vision has added a spark to my home. Guests are blown away when the dashboard pops up and the voice assistant announces a visitor’s actions—it’s a conversation starter that screams “smart home.” My wife loves the system as an extra set of eyes, catching details I’d miss. The “remember” feature, spotting us or our dog over half the time, feels like the home knows us. It’s not just automation; it’s a home that anticipates needs, like lights adjusting to LUX levels without a touch.

Learn More

Check out the LLM Vision documentation and Home Assistant community forums for more examples and troubleshooting. The integration is actively developed, so watch for updates. Got questions or cool LLM Vision tricks of your own? Share them below—I’d love to hear how you’re pushing the boundaries!

Why LLM Vision? The Power of Contextual Intelligence

My Setup: Three Custom Automations

Challenges and Solutions

Unique Techniques

Advice for Advanced Tinkerers

The Wow Factor

Learn More

TTH Newsletter

Menu

Legal

Pushing Home Assistant to New Heights with LLM Vision Integration

Why LLM Vision? The Power of Contextual Intelligence

My Setup: Three Custom Automations

Challenges and Solutions

Unique Techniques

Advice for Advanced Tinkerers

The Wow Factor

Learn More

News & Interviews

Is Your Smart Home Spying on You? The Hidden Cost of “Free” Convenience

Your Home’s Inner Voice: Bringing Local AI to Your Living Room

Beyond Convenience: How Your Thinking Home Anticipates Your Needs (and Simplifies Life)

TTH Newsletter

Menu

Legal