Introduction 

The VIDIZMO Audio Indexer App uses AI models to detect Personal Identifiable Information (PII) entities present in the audios and videos on your Portal. You can also use the application to redact these PII entities if you have Redaction as part of your VIDIZMO package. It even provides you with advanced processing options to make the redaction even more accurate. The AI models used by the applications offer support for multiple languages and predefined PII entities for detection. 

 

This functionality also benefits on-premises customers as it allows them to process their content using the VIDIZMO application within their system, ensuring security by avoiding the need to store sensitive data on public clouds or external storage required by services like AWS and Azure, which are needed for using services like AWS and Azure.   

 

Concept 

To redact PII from your audio or video, you need to have transcriptions generated for them. The VIDIZMO Audio Indexer app will automatically generate the transcriptions for your audio or video if any PII entity is added to its Insights. You can also use the application itself to generate transcriptions separately or opt for other Indexing applications, such as Azure Video Analyzer ARM or AWS Indexer, provided by VIDIZMO. You can even upload your own closed caption or transcription file for the content you want to process for PII; visit How to Add Closed Captions for more information.    

  

Once you have the transcriptions, you can then redact the PII entities that you have added to the VIDIZMO Audio Indexer's insights. You can also detect these PII entities in the supported languages mentioned below.  

 

During the detection process, the application will also factor in the rest of your configurations, such as the minimum confidence score for the PII detections, context keywords, excluded words, time interval threshold and original file handling. See Configuring VIDIZMO Audio Indexer for PII Detection and Redaction for more information. 

  

After the processing is done, the VIDIZMO Audio Indexer creates a Media Culture attribute for your audio or video file. The media culture attribute indicates the Language or Languages that the Media or Evidence consists of (or has content relating to).  

 

Note: PII detection and redaction by the VIDIZMO Audio Indexer app utilizes AI processing as a consumption metric for your VIDIZMO Account. To learn how you can view consumption reports, refer to Consumption Reports for SaaS Deployment Overview.


Supported Languages

Here is a list of languages supported by the VIDIZMO Audio Indexer for PII detection and redaction.  

 

Note: If support for a language is unavailable, contact VIDIZMO support. 

  • Catalan 
  • Chinese 
  • Croatian 
  • Danish 
  • Dutch 
  • English 
  • Finnish 
  • French 
  • German 
  • Italian 
  • Japanese  
  • Korean 
  • Lithuanian 
  • Macedonian 
  • Norwegian Bokmal 
  • Polish 
  • Portuguese 
  • Romanian 
  • Russian 
  • Slovenian 
  • Spanish 
  • Swedish 
  • Ukrainian 


PII Entities 

Here is the list of predefined PII Entities available by default. 

 

Note: To add additional or custom PIIs for detection, you can contact VIDIZMO support. 

  • Address 
  • Age  
  • Australian Business Number 
  • Australian Company Number 
  • Australian Medicare 
  • Australian Tax File Number  
  • Credit Card Number 
  • Crypto Wallet Number 
  • Date Time  
  • Email Address 
  • IBAN Code 
  • Indian AADHAAR  
  • Indian Permanent Account Number 
  • IP Address  
  • Italian Driver License 
  • Italian Fiscal Code 
  • Italian Identity Card
  • Italian Passport Number 
  • Italian VAT Code
  • Medical License 
  • NRP (Nationality, Religious or Political Group)  
  • Organization 
  • Person’s Name  
  • Phone Number  
  • Polish National Identification Number 
  • Profession
  • Singaporean Unique Registered Entity Number 
  • Spanish Personal Tax ID (Número de Identificación Fiscal) 
  • UK National Health Service Number 
  • Unique Identifier 
  • URL  
  • US Bank Number  
  • US Driver License 
  • US Individual Taxpayer Identification Number 
  • US Passport Number 
  • US Social Security Number 
  • User Name 
  • Zip Code 


Confidence Score  

The confidence score or confidence threshold is a value that the AI model uses to determine if the detected object or word is a PII or not. When the input text is analyzed for PII detection, the model breaks the text down into individual components called tokens and assigns a confidence score to each of them. The model then analyzes the score of these tokens and compares it with the score threshold to determine which of these detected objects is a PII entity. A word or a token is classified as a PII entity if its confidence score is higher than that of the score threshold.   

  

Increasing the score threshold means that the model will only classify fewer detections as PII, but this also means that it will pick the more accurate ones. You can keep the score threshold at a high value if you want to ensure that the model only picks out the tokens that it is very confident are PII. On the other hand, lowering the score threshold means that the model is likely to classify more tokens as PII; this is useful in the case when you want to ensure that the model has no chance of leaving out a token that might be a PII entity.  

   

A high confidence score threshold for PII is suitable for text that may contain fewer instances of PII. In comparison, a low confidence threshold is ideal for text that may contain more instances of PII. The confidence score can have a value from 0 to 100, but it is highly recommended that you use 45 for the best results. 


Excluded Words    

The VIDIZMO Audio Indexer also provides you with a field where you can enter a list of words that will not be classified as a PII entity. For example, if you have configured the application to detect 'Organization' as PII, then the word 'VIDIZMO' will be detected and then redacted. However, if 'VIDIZMO' is present in the excluded PII list, this word will be skipped over and not identified as a PII by the application.   

 

Please ensure that you correctly capitalize words, as this field is case-sensitive because an exact match is required for the words to be excluded from PII detection. This applies even when the words are spelled similarly but have different capitalization. For example, if you want 'MARCORP' to not be identified as PII, then you need to add 'MARCORP' in the excluded words field. It should not be 'marcorp' or ‘Marcorp' as the AI model identifies these as separate entities. 


Context Keywords  

You can add context keywords that can enhance the confidence score of the PII entities if they are found within range of them. Leveraging context words to increase the confidence score makes the PII detection more accurate. For the score enhancement to happen, the context keywords need to be present with a range of approximately 5 to 10 words (both before and after) of the PII entity. You can provide a list of the relevant context words in the Audio Indexer application, and they will be utilized to enhance the relevant PII entities.  

  

It is recommended that you enter words relevant to the PII entities you have configured for detection for the most precision and efficiency. For instance, if you want to detect 'PHONE_NUMBER' as a PII entity, you need relevant context words such as phone, number, or contact. Take a look at the sentence below:  

  

"Can you write down my contact? It is 555-555-5555."   

  

In this sentence, the context word to enhance the PII detection is 'contact.' The context word is relevant within the context of the PII.   


Note: The relevancy of the context keywords is essential. It is recommended that you do not overfit the list words as it can interfere with the detection process and not make any confidence score enhancement possible.  

To see how you can perform PII detection and redaction on your Portal, visit How to Perform PII Redaction using VIDIZMO Audio Indexer