Eh, joining these datasets can be challenging. Names can be spelled differently or changed, dates of birth can be off, people can share names and dates of births, addresses change and are can be expressed in multiple ways, databases may store names as a single string or separate fields, middle names may be missing or initials, databases might not share IDs etc. So it's kinda hard to do well although nothing really exciting technology wise.
This, incidentally, is why the "confidence score" is needed. And why the app frequently gets data (including citizenship) wrong.
When we stop tying our health insurance to our employment, we'll see a drastic uptick in people starting their own businesses. Working at company Z because their health insurance is fully paid for by the employer vs working at company Y where it costs you 1,400 a month for HDHP but the salaries are the same shouldn't be a thing
Because the “PII Map” (the link between ID:1 and John Smith) effectively is the PII, we treat it as sensitive material.
The library includes a crypto module that forces AES-256-GCM encryption for the mapping table. The raw PII never leaves the local memory space, and the state object that persists between the masking and rehydration steps is encrypted at rest.
I've bookmarked this for inspiration for a medium/long term project I am considering building. I'd like to be able to take dumps of our production database and automatically (one way) anonymize it. Replacing all names with meaningless but semantically representative placeholders (gender matching where obvious - Alice, Bob, Mallory, Eve, Trent perhaps, and gender neutral like Jamie or Alex when suitable). Use similar techniques to rewrite email addresses ([email protected], [email protected], [email protected]) and addresses/placenames/whatever else can be pulled out with Named Entity Recognition. I suspect I'll in general be able to do a higher accuracy version of this, since I'll have an understanding of the database structure and we're already in the process of adding metadata about table and column data sensitivity. I will definitely be checking out the regexes and NER models used here.
That sounds interesting! I've been thinking about using representative placeholders as well, but while they have their strengths, there are also some downsides. We decided to go with an XML tag also because it clearly identifies the anonymized text as being anonymized (for humans) so mixups don't happen.
After reading your comment I think it would also be really interesting to be able to add custom metadata to the tags. Like if you have a username that you want to anonymize, but your database has additional (deterministic) information like the gender, we should add a callback for you as the user to add this information to the tag.
My hope is it means it assigns coded identifiers and the key remains local. When the document returns, the identifiers can be restored. So the PII itself never leaves the premises.
reply