hoakley January 20, 2024 General, Macs, Technology

The revenge of Unicode

Unicode is an epitome of human achievement: a brilliant idea that has grown out of control to the point where no human can grok it all any more. I sometimes wonder how many of its 149,813 ‘characters’ any one human is likely to use, and suspect for most that’s in the low hundreds or less. All those ‘characters’ enable deliberate misuse, where visual similarities are exploited to spoof people over identity or worse. Let me explain how you can get Unicode revenge without harming a soul.

We still do a great deal in life using text that can be searched rapidly and readily. Sometimes it pays to obfuscate that so that only humans reading it will understand what it says. Whether it’s an eavesdropper bulk-scanning emails, or someone’s AI crawler building your words into its next Large Language Model (LLM), you can make their task inconveniently difficult by recasting its Unicode. For example, the following obfuscated version of a paragraph from one of my recent articles reads clearly to the human eye:

But look more closely at those characters, like
Αlthоugh thе shір's bоаts hаd оrіgіnаllу іntеndеd tо tоw thе оvеrlоаdеd аnd раrtіаllу submеrgеd rаft
Those aren’t what they seem, and on ordinary text searches will draw a blank.

Apparently, some searches now make allowance for that degree of light obfuscation. To make things far harder for them, try the more extreme
Αⅼ𝚝𝚑о𝚞ɡ𝚑 𝚝𝚑е 𝚜𝚑ірᛌ𝚜 bоа𝚝𝚜 𝚑аⅾ о𝚛іɡі𝚗аⅼⅼу і𝚗𝚝е𝚗ⅾеⅾ 𝚝о 𝚝о𝚠 𝚝𝚑е о𝚟е𝚛ⅼоаⅾеⅾ а𝚗ⅾ ра𝚛𝚝іаⅼⅼу 𝚜𝚞b𝚖е𝚛ɡеⅾ 𝚛а𝚏𝚝
which remains thoroughly understandable to humans, but makes most machines give up in confusion.

There are now ways around this obfuscation. Apple’s Live Text does an excellent job of recognition on both those screenshots, but that extra mile of converting all your obfuscated text into images, then using text recognition on them isn’t something that many will try, and it imposes a significant computational burden on the eavesdropper or crawler.

Obfuscation is of course no substitute for encryption: if the text contains secrets that you don’t want others to see at all, then you must encrypt it using a robust method. But for holding off those who are just going to use normal text searching, it should be effective.

Almost seven years ago, I wrote a little utility for obfuscating Latin text at the two levels shown above. Dystextia is fairly basic, but runs a treat in macOS from Sierra to Sonoma. You can also use it to obfuscate shorter sections of text. While Internet domains that include non-standard characters are converted into ‘Punycode’ that makes them difficult to spoof, the rest of the URL is left in its original Unicode, thus preserving any obfuscation.

Perhaps it’s time to see whether you can use Unicode’s code points to conceal other text in steganography.

17Comments

Add yours

1

Lukas on January 20, 2024 at 11:06 am

Great idea but both unicode texts are easily recognized by GPT-4. It doesn’t struggle at all

LikeLiked by 1 person
- 2
  
  hoakley on January 20, 2024 at 11:46 am
  
  Thank you. I’d be interested to know how that would scale up, for example to searching TB of text per minute? What would the computational burden be?
  Howard
  
  LikeLike
3

tempelmann@gmail.com on January 20, 2024 at 12:48 pm

When I tried to add file content search to my Find Any File, I initially tried to support searching in PDFs as well. I took some PDFs from my disk and looked into them, seeing that their text sections usually were encoded/compressed in a format I could easily decode.

So I started writing code to do that and it seemed to work well. That is, until I tried to parse PDFs generated by macOS from plain text documents. There, you’d not find the text in plain ASCII (or Unicode) as one might hope.

Instead, every single letter was individually positioned somewhere on the page. And it could even by that they’re not in order of the original letters!

That’s when I gave up. I would have had to “render” the letters into a virtual page, and then figure out which letters were in a line, and in which order. Doable, but terrible.

Have you ever had the effect that when you copy text from a PDF, you’d end up with garbled plain text, or with duplicate letters? That is probably a result of this, too, where the virtual renderer failed to figure out (and ignore) overlapping letters that might render as bold on a printed page.

I hope that the Spotlight importer as well as Preview’s “Copy” command for PDFs will soon make use of the new ability and then do a better (i.e. failproof) job, reading or copying exactly the visible text and nothing else.

VoiceOver should also benefit from it.

LikeLiked by 1 person
- 4
  
  hoakley on January 20, 2024 at 2:04 pm
  
  Thank you.
  One way of extracting the text effectively is by rendering the PDF into a PDFView, from where you can access the text in the document via PDFView.document?.string!. I think the PDF Spotlight importer should do as well now.
  Howard.
  
  LikeLike
5

tempelmann@gmail.com on January 20, 2024 at 12:50 pm

Huh, thinking about this – I should look into adding support for searching PDFs in Find Any File by using the new text recognition feature, too. I wonder what the performance is.

LikeLiked by 1 person
- 6
  
  hoakley on January 20, 2024 at 2:05 pm
  
  I think that’s only accessible via Spotlight, but search there should be productive.
  Howard.
  
  LikeLike
  - 7
    
    tempelmann@gmail.com on January 20, 2024 at 2:11 pm
    
    Even in Sonoma, the Spotlight PDF importer (`mdimport -d3`) is not able to extract text if the pages are made from images. OTOH, in Preview I can select the text in the same PDF and copy it. Which proves that mdimport lags behind using the new features Apple offers.
    
    LikeLiked by 1 person
    - 8
      
      hoakley on January 20, 2024 at 2:15 pm
      
      Ah: you’re referring to PDFs which haven’t performed any OCR to generate text content, and only contain scanned images. The only way to convert those to text using macOS is via Live Text on the rendered image, AFAIK, and that isn’t saved into the PDF file, or its extended attributes, so has to be performed fresh each time.
      Howard.
      
      LikeLike
    - 9
      
      tempelmann@gmail.com on January 20, 2024 at 2:19 pm
      
      Huh, okay, then that might be an unusual PDF (I just ran into it when I did some quick testing earlier) for a better test. OTOH, wouldn’t it be cool if Spotlight could find text even in image files? The problem is many of Apple’s various Spotlight importers would need to be updated, i.e. all that may have non-textual content, to make use of the new text scanner. That would be true for PDF as well as image importer (wait – is there even one yet? Probably yes, in order to learn meta data from images), and maybe others, too. I wonder they’re even aware of this need at Apple.
      
      LikeLiked by 1 person
    - 10
      
      hoakley on January 20, 2024 at 8:27 pm
      
      They’re not unusual. Most people, though, want the scanned pages in that PDF converted into text using OCR so the document becomes searchable. These days I hardly ever come across PDFs that have been created from scans, though.
      I don’t think this is the sort of task to delegate to a Spotlight importer: I’m not sure whether it might be done for some images by background services like photoanalysisd. But it’s also controversial, as it involves performing image analysis on files without explicit consent, and you never know what could become searchable!
      Howard.
      
      LikeLike
    - 11
      
      Ralf on January 20, 2024 at 10:46 pm
      
      a cautious remark to the experts: When I save pictures (no pdf) of business cards in Apple Notes, spotlight easily finds them based on the text shown on the card.
      
      LikeLiked by 1 person
    - 12
      
      hoakley on January 21, 2024 at 9:51 am
      
      Thank you. For images in folders covered by Spotlight, I believe that’s been true for a good while now. However, I don’t know whether that applies to images embedded in PDFs, and I don’t think that’s the case at present.
      Howard.
      
      LikeLike
    - 13
      
      tempelmann@gmail.com on January 20, 2024 at 11:31 pm
      
      Well, the PDF with all the scanned pages from a manual was one I created myself – with Apple’s Preview. So, it’s still Preview that would have had the task to convert the text in the image into readable text when creating the PDF, wouldn’t it? My point is: Apple should do it either when creating the PDF or when later scanning the pdf with a Spotlight importer. They do neither.
      Besides, which tools would do the job? I hope the (non-free) Adobe Acrobat tool does?
      
      LikeLiked by 1 person
    - 14
      
      hoakley on January 21, 2024 at 9:55 am
      
      Preview doesn’t perform OCR when creating PDFs. For that you’ll need a proper PDF app. Adobe Acrobat paid-for (formerly Pro) certainly will do so if you wish. The text is then saved in the PDF itself, and becomes accessible to Spotlight.
      Preview is not a proper PDF editor, and you should always use a proper PDF editor when you need one.
      Howard.
      
      LikeLike
15

chasbelov on January 20, 2024 at 9:27 pm

Bad for accessibility to screen readers.

LikeLiked by 1 person
- 16
  
  hoakley on January 20, 2024 at 9:27 pm
  
  Of course.
  Howard
  
  LikeLike
- 17
  
  hoakley on January 20, 2024 at 9:53 pm
  
  If you think about it, obfuscation is pretty well an antonym of accessibility, having opposite intent. What’s good for screen readers is also excellent for text search, isn’t it?
  BTW, if you were to take a look at Dystextia, you’d discover that it can take obfuscated text and return it to its original encoding, so if you did want to use a screen reader, it will facilitate that.
  Howard.
  
  LikeLiked by 1 person

·Comments are closed.

Share this:

Related