Building a Proper AI Assistant with NemoClaw

A couple months ago I wrote about Moltbot—how it's essentially Claude Code with a Telegram wrapper and a security posture that would make a pentester cry. No sandboxing, plaintext credentials, exposed admin panels, and supply chain attacks via its skills library. I ended that post saying I'd keep my AI agents on a short leash.

That wasn't idle commentary. I immediately started building something that actually did the job properly. This is that build.

The Problem Statement

What I actually wanted was simple to describe, hard to build: an AI assistant I can talk to through Discord, that runs inference locally on my GPU, that can't touch files or network resources I haven't explicitly allowed, and that doesn't phone home to a cloud API with my conversations.

Moltbot's failure wasn't the concept—it was the execution. An always-on AI assistant that can take action is genuinely useful. But those exact attributes demand a serious security model, and Moltbot treats them as an afterthought. The irony is that NVIDIA already had a proper solution. I just had to actually deploy it.

What NemoClaw Is

NemoClaw is NVIDIA's OpenClaw agent running inside their OpenShell sandbox framework. OpenClaw is an agentic AI assistant—it can write code, run shell commands, browse the web, manage files. OpenShell is the containment layer: a k3s-based runtime that enforces network policies and Landlock filesystem isolation at the kernel level.

Landlock is a Linux Security Module that restricts filesystem access independent of process permissions. Even if the agent runs as a privileged user, Landlock enforces what directories it can actually touch. This is meaningfully different from "I put it in a Docker container and called it sandboxed." It's kernel-enforced isolation.

The network side uses declarative policy files. You define exactly which external hosts the sandbox can reach; the OpenShell gateway enforces it. For inference, NemoClaw integrates with NVIDIA's NIM microservices—optimized model containers with TensorRT-LLM backends. Instead of paying per token and sending queries to someone else's servers, the model runs on my RTX 5000 Ada under my desk.

The Stack

Here's what the full system looks like:

Discord #tpk-agent
      │
      ▼
discord-bridge/bot.py  (Python, discord.py)
      │  SSH subprocess per message
      ▼
openclaw agent  (inside tpk-agent sandbox)
      │
      ▼
openclaw gateway  (ws://127.0.0.1:18789)
      │
      ▼
nim-proxy.py  (request rewriter, port 8009)
      │
      ▼
NIM container  (Nemotron Nano 4B v1.1, TRT-LLM, RTX 5000 Ada)

Discord messages hit a Python bot that SSHes into the sandbox and runs openclaw agent --message "..." --json. The response comes back as JSON and gets posted to the channel. The sandbox routes inference through the openclaw gateway to the NIM container via a compatibility proxy I wrote—and that proxy is where most of the actual work happened.

The Real Work: NIM Compatibility

OpenClaw is designed for OpenAI-compatible APIs, but it sends several parameters that NIM's TRT-LLM backend rejects with HTTP 400s. The format issues were mechanical: tool_choice needs to be a plain string, not an object; message.content needs to be a plain string, not a typed array; max_completion_tokens must become max_tokens; store: false needs stripping entirely. The proxy handles all of this on the fly.

The harder problem was tool schemas. OpenClaw sends full function schemas with every request. NIM's backend compiles these into a Finite State Machine for structured output—which takes 10–15 minutes per unique schema, blocks the GPU completely, and resets every time NIM restarts. I found this by watching GPU utilization sit at 100% for fifteen minutes and checking what TensorRT-LLM was actually doing.

The fix: strip tool definitions at the proxy layer. With STRIP_TOOLS=1, the proxy removes tools and tool_choice before forwarding, NIM sees a plain completion request, and it responds in under two seconds. The tradeoff is the model can't use tools—acceptable for my use case of a general Discord assistant.

The Model Problem

The original NemoClaw documentation is written around the 120B Nemotron Super model on NVIDIA cloud infrastructure. I'm running the 4B Nemotron Nano locally. The 8B OOM'd—TensorRT-LLM at bf16 precision needs more VRAM headroom than my 16GB card allows once WSL2 and the OS take their share. The 4B fits comfortably.

The 4B model handles general Q&A well, but it struggled with OpenClaw's full system prompt—approximately 27,000 characters of agent instructions, workspace context, and tool documentation. It would occasionally output NO_REPLY, OpenClaw's special silent response token, instead of actually responding. The model was getting confused by conditional instructions in the tool descriptions that referenced the token in edge cases.

The fix was to deny all tools at the OpenClaw config level (tools.deny: ["*"]), which removes their descriptions from the system prompt entirely. The resulting ~16,000-character prompt the model handles correctly, and responses are typically under two seconds from Discord message to reply.

The Security Model (vs. Moltbot)

Credential storage: Moltbot stores API keys in plaintext under ~/.clawdbot/. NemoClaw's sensitive config lives inside the OpenShell sandbox container. The host has an .env for startup variables in .gitignore. Commodity infostealers already know Moltbot's key paths; the OpenShell container is not a standard target.

Sandboxing: Moltbot runs with your full user permissions. OpenShell enforces Landlock filesystem restrictions at the kernel level—the agent can only reach directories explicitly granted in its policy. My SSH keys, browser cookies, and the rest of the host are off-limits by default.

Network access: Moltbot's network access is whatever your OS allows. OpenShell uses a YAML network policy file I commit to the project repo. Nothing else reaches out. If the model tries to exfiltrate data, the policy blocks it at the network layer.

Inference: Moltbot calls out to Claude, GPT, or whatever API you've configured—every query leaves your machine. With NIM, inference runs locally on my GPU. My conversations don't leave the host.

Supply chain: OpenClaw's skills library has significantly less community contribution than Moltbot's ClawdHub. That's sometimes frustrating, but it's the right tradeoff. I'm not installing skills from anonymous contributors until the ecosystem matures.

Is it perfect? No. The sandbox configuration took real work to get right, and if you grant the agent permissions, it can still act within them. Security is about reducing attack surface, not eliminating it. But it's meaningfully better than Moltbot's "use at your own risk" posture.

Hypothetical Use Cases

The scenarios below are speculative. Any use of agentic AI in a real investigative or legal context should be reviewed by qualified legal and forensic professionals. AI output is not a substitute for trained human analysis.

I work in digital forensics and incident response. One workflow that comes to mind: warrant returns from electronic service providers. These returns typically unpack into sprawling directory trees of JSON, SQLite databases, CDR spreadsheets, and account metadata. A local agent with shell access could automate the unpack chain, inventory file types, and surface data categories before an investigator starts digging. Natural language queries against call records, schema discovery on unfamiliar databases, cross-referencing device identifiers across returns—these become conversational rather than manual.

The critical word is local. ESP returns contain victim and suspect data that cannot go to a cloud API—full stop. A model running on your hardware, with no network egress, scoped to only the files relevant to the case, is a fundamentally different proposition than pasting evidence into a chat interface backed by someone else's servers. For classroom use, the full visible pipeline—model, sandbox, network policy—also makes data sovereignty concrete rather than abstract. You can show students what the model can't reach, which is often the more important lesson.

The Result

After all that, what I have is an AI assistant—I named it Ahab—that runs entirely on hardware I own, responds in under two seconds, and can't touch anything on my system I haven't explicitly granted it access to. Message to #tpk-agent, typing indicator while the SSH subprocess runs, response appears. Context maintained within sessions.

The security model is one I can actually defend: kernel-enforced filesystem isolation, declarative network policy as code, local inference with no data leaving the host.

Is a 4B model going to replace Claude Code for complex engineering tasks? No—and that's not what this is for. It's an always-on assistant that can answer questions, help me think through problems, and run simple tasks without handing my credentials to an unsandboxed process or trusting a cloud API not to log my queries. When I have a larger GPU budget, I'll point this at a 70B model through Ollama. The architecture is ready—swap the model ID and the NIM container image, and everything else stays the same.

Until then, Nemo runs on four billion parameters and answers in two seconds, and I know exactly what it can and can't access.

That's the short leash I was talking about.

← Previous Post Next Post →