More

gavinuhma · 2025-03-12T15:58:42 1741795122

Awesome feedback. I’ll think through this.

The sandbox is spot on; Control what the server can do. Especially important when running locally

gavinuhma · 2025-03-11T22:26:06 1741731966

Thanks!!

Hrm, re that error: What does “god --version” say?

The log might not show up until you get a successful connection. I’ll look into that.

Thanks for trying it out!

nbbaier · 2025-03-12T00:32:27 1741739547

I removed the original config and ran the following:

  19:28:36 ~ $ npx -y mcpgod --version
  mcpgod/0.0.2 darwin-arm64 node-v23.9.0

  19:28:47 ~ $ npx -y mcpgod add @modelcontextprotocol/server-everything --client 
  claude --tools=echo,add

Same error and log occured

gavinuhma · 2025-03-11T22:29:06 1741732146

I’ve made the mistake of typing “--tools echo” before instead of “--tools=echo”, just in case that was your error too.

therealpygon · 2025-03-14T13:22:24 1741958544

What a polite way to point out that their error was a typo.

gavinuhma · on June 28, 2023

Good question! Uploaded documents get converted to embeddings within the Nitro Enclave (NE), and then the embeddings are encrypted with a key that only the NE has access to.

When the search endpoint is called the encrypted embeddings are pulled into the NE and decrypted. They are then loaded into a vector db in-memory and the search is executed all within the NE. This adds some latency but it’s more secure because embeddings are only accessible to the enclave.

In the case of chat history it is never stored by the API. The developer can develop their own client side. With CapeChat we keep chat history local on the device.

gavinuhma · on June 27, 2023

Yikes, we'll have to remove that. It's a really old course on privacy-preserving machine learning from 4 years ago and has nothing to do with this product despite the generic name.

Please see https://api.capeprivacy.com/v1/docs#/ for more info.

luke-stanley · on June 27, 2023

I can confirm it was forked in the Capeprivacy GitHub repos list. The name is the same as the PII remover mentioned in the link, which I wanted to see how it worked!

gavinuhma · on June 27, 2023

Entirely local and 0 sub-processors is the ideal! I hope we are trending that way as an industry

gavinuhma · on June 27, 2023

Good points. I think the rabbit hole of OpenAI sub-processors is not commonly understood.

The humans at TaskUS are moderating prompts, and then you have Azure, CloudFlare, and Snowflake as sub-processors, each with their own list of sub-processors and on and on.

https://platform.openai.com/subprocessors

Data breaches can happen, so any data that you throw over the wall to OpenAI you must be willing to accept that it could become public.

gavinuhma · on June 27, 2023

Good question, some developers implement a manual approval step, so you can review the redacted prompt before you submit it rather than making it automatic. It depends on their product requirements.

Re mechanism, the redactions themselves are powered by a language model.

gavinuhma · on June 27, 2023

Yep! The more you can do locally the better. An entirely local LLM is the best for data privacy and security. Any time data leaves it poses some risk.

The de-identification itself requires a complex language model, which has its own complexity and costs to operate. At Cape we're going as far as we can to offer a secure API that's self-serve and easy to use to make these feature accessible to developers, but it does require trust in Cape and the underlying AWS Nitro Enclaves that we use. Client-side attestation is a security feature that can help provide cryptographic verification to the client of the secure enclave. But local is always better when possible!

dbesemer · on June 27, 2023

I will add that running your own private LLM is complicated and costly; and that private LLM (at this point) will not be as capable as GPT-4. So while running a private LLM will certainly be the right solution for some, Cape's offering makes improved privacy available to many.

gavinuhma · on June 27, 2023

That's right. So in the case of credit card numbers we redact it like [CREDIT_CARD_NUMBER_1], [CREDIT_CARD_NUMBER_2], etc so the LLM can still answer prompts like "how many", but it can't answer prompts like "sort". But you can use OpenAI function calling API to do the sort, where your function re-identifies, sorts, and then de-identifies again.

gavinuhma · on June 27, 2023

It's a great question. Redaction limits the LLMs ability to draw on the underlying training data on the subject. This can work to the developers benefit in many cases, like asking questions about your own provided context.

Many developers have gotten away from relying on LLMs for facts, toward providing LLMs with facts and having those facts repurposed.

For example, if you ask an LLM about a famous person, like Wayne Gretzky, it may give you a good answer but there is a chance it may hallucinate key details like the number of points he had in his NHL career.

To combat this, you can provide the LLM with a biography of Wayne Gretzky and you may get more factual answers, but the LLM may still hallucinate if you probe for facts that were not provided.

If you redact his name instead, for example asking “Who is [Name1]?” the LLM will be unable to answer the question without further context. But now, if you provide the redacted biography the LLM can answer the question while relying only on the provided context (the biography will contain information about [Name1]). If the question falls outside of the context the LLM will not be unable to answer, which is often the desired result. In other words, the LLM is unable to rely on the training data about Wayne Gretzky because it is only dealing with [Name1] along with redacted locations, organizations, occupations, etc from the biography about [Name1]. You force the model to rely on the provided facts.

The use cases we see are people providing legal contracts and financial statements where names and currencies get redacted, and the LLM must work with the redacted values and any other context provided.

canadiantim · on June 27, 2023

that's actually pretty brilliant. I can imagine this also being useful for adding a chatbot for a website's content and really trying to limit the responses to only the content from the website as much as possible.

moffkalast · on June 27, 2023

Damn, that is actually a really cool approach.

I suppose most LLMs are not smart enough to make the connection and can be probably told to avoid doing it, but I would imagine that it's not impossible for it to figure out that Name1 is likely Wayne Gretzky from context?

Edit: Yep, it's definitely a problem unless the facts are also anonymized I guess: https://chat.openai.com/share/84dbe124-dca7-46e3-be73-79b194...

gavinuhma · on June 27, 2023

I redacted the full wikipedia paragraph with the API. Like, the nickname "The Great One" is a pretty major tell!

[NAME_GIVEN_1] [NAME_FAMILY_1] CC ([NAME_GIVEN_2] [NAME_FAMILY_2]; born [DOB_1]) is a [ORIGIN_1] [OCCUPATION_1] and [OCCUPATION_2]. He played 20 seasons in the [ORGANIZATION_1] ([ORGANIZATION_2]) for four teams from [DATE_INTERVAL_1] to [DATE_INTERVAL_2]. Nicknamed \"the Great One\",[1] he has been called the greatest [OCCUPATION_3] ever by many [OCCUPATION_4], [OCCUPATION_5], The Hockey News, and by the [ORGANIZATION_2] itself,[2] based on extensive surveys of [OCCUPATION_6], [OCCUPATION_7], [OCCUPATION_8] and [OCCUPATION_9].[3] [NAME_FAMILY_1] is the leading goal scorer, assist producer and point scorer in [ORGANIZATION_2] history,[4] and has more career assists than any other [OCCUPATION_10] has total points. He is the only [ORGANIZATION_2] [OCCUPATION_10] to total over 200 points in one season, a feat he accomplished four times. In addition, [NAME_FAMILY_1] tallied over 100 points in 15 professional seasons, 13 of them consecutive. At the time of his retirement in [DATE_INTERVAL_2], he held 61 [ORGANIZATION_2] records: 40 regular season records, 15 playoff records, and 6 All-Star records.[2]

moffkalast · on June 27, 2023

> Based on the information provided, NAME_GIVEN_1 NAME_FAMILY_1, also known as NAME_GIVEN_2 NAME_FAMILY_2, played in the ORGANIZATION_1, which is also referred to as ORGANIZATION_2. He played for four teams within this organization over the course of 20 seasons, from DATE_INTERVAL_1 to DATE_INTERVAL_2.

Hey that's actually pretty good.

gavinuhma · on June 27, 2023

You can use CapeChat UI to mess around with it: https://chat.capeprivacy.com/

Or you can also create a free API key here: https://app.capeprivacy.com/api-keys to use the interactive API directly: https://api.capeprivacy.com/v1/docs#/Privacy/DeidentifyText

Click "Authorize" on the top right to add the key, and then click "Try it out" on any of the endpoints.

gavinuhma · on June 27, 2023

Exactly. A super famous person like Wayne Gretzky is really hard to protect.

For fun, you can try to tease ChatGPT with information like. "Who is [Name1]?", it won't know, but then add "[Name1] is considered the greatest [Occupation1] in the history of the [Organization1]". Greatest is now a big clue. Add "[Name1] has the most points in history". Points is a big clue, it's some kind of game or sport.. etc. It will eventually figure it out, but I've seen it guess wrong with like Michael Jordan instead.