shubham2802's comments

shubham2802 · 2026-03-11T01:25:53 1773192353

I am pretty sure we don't have balance. It's a bait :)

neya · 2026-03-11T02:31:35 1773196295

Sorry, but, this is not really a confidence inspiring response. Accepting the mistake and fixing the leak altogether would have been the better way to handle this. This is a developer forum, we all make mistakes. Framing it as bait just sounds like bad PR management.

How can we trust your product if you can't fulfil basic security 101? Not being harsh but this kind of lax response for a serious mistake is not acceptable to me. Imagine I recommend you to my company and you end up leaking out our credentials and respond with something like this.

I might be picky here about this, but long term trust starts with accountability.

All the best on your product launch and cheers.

shubham2802 · 2026-03-11T04:48:51 1773204531

my earlier reply was too glib. Even though the key had no usable balance, it still should not have been exposed. We’re removing it now and fixing the demo flow so this doesn’t happen again. Thanks for calling it out. Cheers!

word_saladist · 2026-03-11T05:16:36 1773206196

This is pretty far off from being an intelligible sentence. I wonder if it’s a symptom of people getting used to LLMs being able to parse intent and meaning from fragmentary, disjointed text such as this.

mnafees · 2026-03-11T12:04:01 1773230641

Hey Shubham, I can still see the API keys in https://www.runanywhere.ai/web-demo, FWIW. A simple proxy of the request from the frontend to your own API and then to the vendor API would solve this. Also recommend rate limiting on the same. Happy to help if you need further assistance.

neya · 2026-03-11T06:11:00 1773209460

No worries, like I said, we all make mistakes. Live and learn. All the best.

shubham2802 · 2026-03-11T04:41:49 1773204109

I see, sure will fix it asap. Again, thanks for feedback.

gigatexal · 2026-03-11T05:13:31 1773206011

Yeah wow. These responses to constructive feedback show an immature team full of hubris. This whole thing is DOA to me. Thank you HN for showing me this.

shubham2802 · 2026-03-10T22:52:38 1773183158

It does tries to have some memory management done too - to remember previous context + some auto compact feature.

Additionally, personality feature - try it out!! Super fun :)

shubham2802 · 2026-03-10T22:26:07 1773181567

Yes, we do have plans to support it.

shubham2802 · 2026-03-10T21:46:17 1773179177

RunAnywhere builds software that makes AI models run fast locally on devices instead of sending requests to the cloud.

Right now, our focus is Apple Silicon.

Today there are two parts:

MetalRT - our proprietary inference engine for Apple Silicon. It speeds up local LLM, speech-to-text, and text-to-speech workloads. We’re expanding model coverage over time, with more modalities and broader support coming next.

RCLI - our open-source CLI that shows this in practice. You can talk to your Mac, query local docs, and trigger actions, all fully on-device.

So the simplest way to think about us is: we’re building the runtime / infrastructure layer for on-device AI, and RCLI is one example of what that enables.

Longer term, we want to bring the same approach to more chips and device types, not just Apple Silicon.

For people asking whether the speedups are real, we’ve published our benchmark methodology and results here: LLM: https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e... Speech: https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t...

mirekrusin · 2026-03-10T23:34:24 1773185664

From LLM benchmarks it looks like it's better to use open source uzu than RunAnywhere's proprietary inference engine.

[0] https://github.com/trymirai/uzu

sanchitmonga22 · 2026-03-11T01:54:33 1773194073

uzu is a strong engine, it beat us on Llama-3.2-3B (222 vs 184 tok/s) and we reported that honestly in our benchmarks.

But looking at the full picture across all four models tested:

Qwen3-0.6B: MetalRT 658, uzu 627

Qwen3-4B: MetalRT 186, uzu 165

Llama-3.2-3B: uzu 222, MetalRT 184

LFM2.5-1.2B: MetalRT 570, uzu 550

MetalRT wins 3 of 4. The bigger difference is that MetalRT also handles STT and TTS natively, uzu is LLM-only. For a voice pipeline where you need all three modalities running on one engine with shared memory management, that matters.

That said, uzu is great open-source software and worth checking out if your looking for an OSS LLM-only engine on Apple Silicon.

concats · 2026-03-11T07:58:39 1773215919

How does it compare for models of any meaningful size?

These 0.6B-4B models are, frankly, just amusing curiosities. But commonly regarded as too error prone for any non-demo work.

The reason why people are buying Apple Silicon today is because the unified memory allows them to run larger models that are cost prohibitive to run otherwise (usually requiring Nvidia server GPUs). It would be much more interesting to see benchmarks for things like Qwen3.5-122B-A10B, GLM-5, or any dense model is the 20b+ range. Thanks.

LuxBennu · 2026-03-11T13:18:00 1773235080

Agreed. The real value proposition of Apple Silicon for local inference is running models that won't fit on consumer GPUs. I run Qwen 70B 4-bit on an M2 Max 96GB through llama.cpp and it's usable — not fast, but the unified memory means it actually loads. Would be interested to see MetalRT benchmarks at that scale, since the architectural advantages (fused kernels, reduced dispatch overhead) should matter more as models get memory-bandwidth-bound.

sanchitmonga22 · 2026-03-11T15:02:13 1773241333

Fair criticism. Our benchmarks are on small models because MetalRT was built for the voice pipeline use case, where decode latency on 0.6B-4B models is the bottleneck.

You're right that the bigger opportunity on Apple Silicon is large models that don't fit on consumer GPUs. Expanding MetalRT to 7B, 14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter even more at that scale where everything becomes memory-bandwidth-bound.

We'll publish benchmarks on larger models as we add support. If you have a specific model/size you'd want to see first, that helps us prioritize.

shubham2802 · 2026-03-10T20:37:20 1773175040

Sorry about that but this is what is being there in github : Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.

shubham2802 · 2026-03-10T20:07:00 1773173220

Just need some few days to have our catalog of models out soon!!

shubham2802 · 2026-03-10T20:02:16 1773172936

Please open the issue - if it's not working ? I believe you should be able to install it via : curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/in... | bash

harvenstar · 2026-03-10T21:47:10 1773179230

Cool project — been looking for something like this. Just opened a PR with a couple of new macOS actions (empty_trash + toggle_do_not_disturb). Happy to contribute more and quick chat if you're open to it.

shubham2802 · 2026-03-11T00:20:30 1773188430

Will look into it. Yes, would love to chat

shubham2802 · 2026-03-10T20:00:46 1773172846

Updating the readme asap - but thanks for the feedback. Also, please checkout few things : https://www.runanywhere.ai/blog/metalrt-speech-fastest-stt-t... https://www.runanywhere.ai/blog/metalrt-fastest-llm-decode-e...

shubham2802 · 2026-03-10T17:40:36 1773164436

Fully local - no data is collected!!