Protecting patient privacy is of utmost importance. The Health Insurance Portability and Accountability Act (HIPAA) sets the standards and rules for the appropriate use and requirements for the protection of patient information.
Organizations that use patient data for internal or external research need to take steps to prevent the exposure of PHI to those who are not authorized to view it. This can be accomplished by de-identifying health information, which entails masking specific categories of identifiers from the document. Once the identifiers are masked, the risk profile of these datasets is significantly reduced.
There are two approaches to redacting PHI from medical documents. The first is by hand using trained clinicians, but this process is slow, expensive, and does not scale - humans get tired and are prone to error. The second is to use computers to identify and mask protected health information, but that also comes with its own set of challenges.
Most commonly available de-id systems leverage machine learning (ML) approaches to identify PHI from text, tag, and mask it. These systems often tout that they are able to redact above the HIPAA required level of 99%. In reality, most of these systems have only been tested and evaluated against publicly available datasets such as I2B2 or MIMIC. While useful, these datasets are not representative of the complexity and heterogeneity of unstructured data that we see in the real world. Systems tested against these datasets typically do not perform well on novel datasets and struggle to consistently meet the 99% threshold. In addition, these datasets were constructed from scratch as clean, well formatted, machine readable text. In the real world, patient records are messier - a collection of faxes and scans with tables and columns that are much more likely to confuse other de-identification systems.
An improvement on a purely ML approach is to leverage rule-based systems (e.g. white and black lists and regex) to identify additional PHI elements in an attempt to bring performance above the 99% level. However, these rule-based overlays leave room for error. Let's say you want to redact zip codes from patient documentation. To do this you can create a blacklist of zip codes. Easy, right? Unfortunately, no. Zip codes can be easily confused with lab test codes and, if redacted, important contextual information could be unnecessarily masked. Another example is disease names, which are often named after real people (e.g. Parkinson’s, Stevens-Johnson syndrome). Commonly available systems have trouble distinguishing between the two. The 99% threshold is unforgiving and these edge cases often result in subpar performance.
Redact is Mendel’s de-identification module. Mendel leverages a combination of deep learning algorithms and several rule based systems, including our proprietary medical ontology. The combination of methods allows Redact to de-identify a document without solely relying on ML based approaches or having a scientist hard-code all the rules, while preserving as much text as possible. Mendel has developed a proprietary "symbolic learning" architecture that combines the best of the machine learning and the symbolic AI worlds. It trains Redact to de-identify a document without solely relying on ML based approaches or having a scientist hard-code all the rules, while preserving as much text as possible and generalizing to new data.
In a nutshell, Mendel developed what we call a multi-teacher-single-student neuro-symbolic system.
The student (Redact) is a neuro-symbolic network that learns how to manipulate tokens and rules (or components of rules) to de-identify clinical text. The teachers are also AI systems trying to achieve different (sometimes competing) objectives, and the objective function is to figure out how to train the student to satisfy all teachers.
For example, one teacher is a "re-identification" system trying to re-identify the patient after redaction; if it succeeds, then the student gets a penalty back-propagated through the student's neural network. Another teacher is a clinical NLU system trying to figure out the patient's journey; if the student redacts useful clinical info, it will get penalized. This proprietary architecture gives Mendel Redact its edge by learning to prevent re-identification while keeping the medical text intact.
As an output from our Redact engine, Mendel provides you with:
Mendel worked with Mirador Analytics, a widely recognized expert in statistical disclosure risk analysis, to assess the performance of Redact. Across multiple assessments and heterogenous datasets, Redact performed well above the HIPAA threshold to provide confidence that the processed datasets are sufficiently de-identified.
In this example, a total of 1,285 records were reviewed to determine the proportion of identifiers that were correctly masked from the processed records. To be considered compliant with HIPAA Privacy Rule requirements, the proportion of identifiers masked from all documents must exceed 99%. For this assessment, the proportion of identifiers that were successfully redacted was 99.85% – well above the standard for HIPAA compliance.
Mendel’ Redact is part of an end-to-end solution that uses the power of a machine and the nuanced understanding of a clinician to structure unstructured patient data at scale. Want to learn about Mendel’s process and modules? Contact hello@mendel.ai.