Finally, A Way To Send Malicious PDFs With URLs In Russian Letters

SophosLabs UncutPDF malwarepunycodeSpam

Spammers increasingly use PDF’s ability to embed hyperlinks into documents to direct recipients of malicious PDFs to open malicious Web sites. We’ve discovered a spam campaign using internationalized URLs embedded in PDF document attachments.

SophosLabs Uncut

By Jason Zhang

In the course of investigating a spam campaign involving malicious PDF attachments, I found something I hadn’t seen in a malicious PDF before: an embedded link to an external Web site written in a foreign character set (in this case, Russian Cyrillic)

Distributed widely in a spam campaign, the message unabashedly promotes an “email newsletter/marketing” business; It includes a PDF attachment. But there were no elaborate exploits in these PDFs, nor were they particularly complex. They’re usually a single page, no more complicated than what could generously be called a background photo — let’s call it B-grade clip art — overlaid with marketing text and an enticement to click a link embedded in the document.

For the mere cost of 3500 rubles, I could spam 1088958 users (an oddly specific number if you ask me). Still, it’s a bargain. You can’t buy anything for a single penny (or ruble) anymore, except spam targets. At today’s exchange rate, this spammer’s services promise to hit about 311 victims for each ruble, or cent and a half, you spend.

In the samples we analyzed, we found two Punycode URLs in each PDF: one linked to the picture, and the other linked to the cyrillic text at the bottom of the page.

Why Is This Bad?

Spammers have toyed with what are known as Internationalized Domain Name (IDN) homograph attacks for years. While the letter “p” for example appears in both the Latin and Cyrillic alphabet, to a computer, the Cyrillic Р (pronounced like a Latin R) is completely different from the Latin P. That means I could mix up the character sets and send you a link to a malicious РАYМENТ-SΙTЕ.com that, when pasted in the browser address bar, looks indistinguishable to the human eye like the real

Punycode has as ASCII component to it, so if you view that malicious URL in a plain text editor instead of in an application that renders the Punycode URL in its character set (like an email client or in the browser address bar), it might look like this:

Easy to spot, right? Except when it isn’t. Many programs automatically recognize Punycode text and render it the way it was intended, without regard to whether that intention was good or evil.

The bottom line is that IDN homograph attacks made it much harder for casual users to detect phishing attacks, since the visual appearance of the link itself is deceptive.

Adobe Reader does throw a warning dialog (in its default settings) if you try to visit a link from within a PDF. You can only proceed if you click ‘Allow,’ which is not much of a barrier. The dialog box in Adobe Reader doesn’t render the Punycode, so you see the ugly ASCII rendition and, if you’re paying the least bit of attention, hopefully you notice the peculiar link and don’t just semiconsciously click the Allow button anyway.

And yes, this is in whole fact and without a shred of irony, a spammed PDF with a shady, embedded link to a Web site advertising its wink-wink-nudge-nudge email newsletter/marketing business calling itself Russian Post Office. You can strip out the Punycode window dressing using Google or any number of converter sites, but some stains just don’t come out.

Why Punycode?

Punycode isn’t puny at all. In fact, it’s huge. Its unassuming name masks the fact that it is a method for sending written words in character sets that don’t exist in ASCII text, mainly the Latin alphabet, plus all the other normal punctuation and numbers. Punycode’s capabilities span from ancient Greek to modern Arabic, from Chinese to Cyrillic. It is a linguistic giant.

Using Punycode, host names containing non-ASCII Unicode characters are transcoded to a subset of ASCII consisting of letters, digits, and hyphen, which is called the Letter-Digit-Hyphen (LDH) subset. Before displaying a non-ASCII URL in a browser address bar, all browsers first translate non-ASCII URLs into Punycode in the background before performing a DNS lookup. Punycode encoded URLs are prefixed with “xn--“. For example, a book related website in Russian might have a domain name like книги.tld where книги means “books” in Russian and tld is some top-level domain (TLD). This domain name has two parts, книги and tld. The second part is pure ASCII, and is left unchanged. The first part will be converted to Punycode which results in c1ajbfp. It is then prefixed with “xn--” to produce “xn-- c1ajbfp“, and the full Punycode encoded URL becomes xn-- c1ajbfp.tld. It is no surprise that internationalized domain names (IDNs) become widely used as IDNs allow people around the world to use domains in local languages and scripts. For spammers, using a non-ASCII domain name could possibly increase their attack success rate as well when targeting people speaking a specific language.

There also exists a problem with Punycode, such as Homograph phishing attacks (where the website address looks legitimate, but is not, because one or more characters have been replaced deceptively with non-ASCII Unicode characters). There are quite a few Unicode characters represented in alphabets such as Greek, Cyrillic, and Armenian, which look almost identical to Latin letters at a glance, but are treated very differently by computers when resolving the different web addresses. For example, the Cyrillic “a” looks very similar to the ASCII “a”. A good example on how to launch a Punycode phishing attack by displaying a Cyrillic domain name similar to apple[.]com can be found at . If a non-ASCII URL is detected as a phishing website, many modern web browsers use Punycode URL to represent non-ASCII Unicode characters in the URL to defend against the Homograph phishing attacks.

The Stats of the Campaign

Fig. 4 illustrates the breakdown of the monthly PDF hits we recorded over the last few months since January this year. As it shows, the campaign started in February with less than 50% PDF documents embedded with Punycode URLs, it then dominated the bars from March to May. It is worth noting that the stats shown in July bar is up to July 26 only (at the time of writing this article). As the figure shows, the campaign has been going on for six months, though it started to cool down in the past two months.

Fig. 4 – Monthly stats of PDF with/out Punycode URLs.

Though there are over 25,000 hits recorded in total from this campaign, the number of domains detected is relatively small. Our investigation shows that only four websites are associated with the campaign so far, as listed below:

Sophos Detection

All Sophos customers are protected from this campaign.

1 Comment

It’s interesting to me that I had ever explored punycode during preparing a paper, -NatureDNS. Yes, it is weird and error-prone to include symbols in IDN.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.