Sorry, but, this is not really a confidence inspiring response. Accepting the mistake and fixing the leak altogether would have been the better way to handle this. This is a developer forum, we all make mistakes. Framing it as bait just sounds like bad PR management.
How can we trust your product if you can't fulfil basic security 101? Not being harsh but this kind of lax response for a serious mistake is not acceptable to me. Imagine I recommend you to my company and you end up leaking out our credentials and respond with something like this.
I might be picky here about this, but long term trust starts with accountability.
my earlier reply was too glib. Even though the key had no usable balance, it still should not have been exposed. We’re removing it now and fixing the demo flow so this doesn’t happen again. Thanks for calling it out.
Cheers!
This is pretty far off from being an intelligible sentence. I wonder if it’s a symptom of people getting used to LLMs being able to parse intent and meaning from fragmentary, disjointed text such as this.
Hey Shubham, I can still see the API keys in https://www.runanywhere.ai/web-demo, FWIW. A simple proxy of the request from the frontend to your own API and then to the vendor API would solve this. Also recommend rate limiting on the same. Happy to help if you need further assistance.
Yeah wow. These responses to constructive feedback show an immature team full of hubris. This whole thing is DOA to me. Thank you HN for showing me this.
RunAnywhere builds software that makes AI models run fast locally on devices instead of sending requests to the cloud.
Right now, our focus is Apple Silicon.
Today there are two parts:
MetalRT - our proprietary inference engine for Apple Silicon. It speeds up local LLM, speech-to-text, and text-to-speech workloads. We’re expanding model coverage over time, with more modalities and broader support coming next.
RCLI - our open-source CLI that shows this in practice. You can talk to your Mac, query local docs, and trigger actions, all fully on-device.
So the simplest way to think about us is:
we’re building the runtime / infrastructure layer for on-device AI, and RCLI is one example of what that enables.
Longer term, we want to bring the same approach to more chips and device types, not just Apple Silicon.
uzu is a strong engine, it beat us on Llama-3.2-3B (222 vs 184 tok/s) and we reported that honestly in our benchmarks.
But looking at the full picture across all four models tested:
Qwen3-0.6B: MetalRT 658, uzu 627
Qwen3-4B: MetalRT 186, uzu 165
Llama-3.2-3B: uzu 222, MetalRT 184
LFM2.5-1.2B: MetalRT 570, uzu 550
MetalRT wins 3 of 4. The bigger difference is that MetalRT also handles STT and TTS natively, uzu is LLM-only. For a voice pipeline where you need all three modalities running on one engine with shared memory management, that matters.
That said, uzu is great open-source software and worth checking out if your looking for an OSS LLM-only engine on Apple Silicon.
How does it compare for models of any meaningful size?
These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.
The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.
Agreed. The real value proposition of Apple Silicon for local inference is running models that won't fit on consumer GPUs. I run Qwen 70B 4-bit on an M2 Max 96GB through llama.cpp and it's usable — not fast, but the unified memory means it actually loads. Would be interested to see MetalRT benchmarks at that scale, since the architectural advantages (fused kernels, reduced dispatch overhead) should matter more as models get memory-bandwidth-bound.
Fair criticism. Our benchmarks are on small models because MetalRT
was built for the voice pipeline use case, where decode latency
on 0.6B-4B models is the bottleneck.
You're right that the bigger opportunity on Apple Silicon is large
models that don't fit on consumer GPUs. Expanding MetalRT to 7B,
14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter
even more at that scale where everything becomes memory-bandwidth-bound.
We'll publish benchmarks on larger models as we add support. If you
have a specific model/size you'd want to see first, that helps us
prioritize.
Sorry about that but this is what is being there in github : Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.
Cool project — been looking for something like this.
Just opened a PR with a couple of new macOS actions (empty_trash + toggle_do_not_disturb). Happy to contribute more and quick chat if you're open to it.