Unicode (Sep 2)

On a whim, our staff pasted some emojis into the new accounting software last week. It worked! The text saved fine to the database, and displayed properly. Much easier than expected. It’s all thanks to Qt’s robust support for UniCode.

Unicode is the standard format for all kinds of text, these days. It includes Ελληνικά (Greek), Українські (Ukranian), 한국어 (Korean), and any other language on this planet. Plus ♔♕♖♗♘♙, 😝🍑🏈🔥💩 and more.

This is one area where it really paid for us to procrastinate. Computer text has had some difficult times, but we missed most of the pain.

A typical English-language typewriter or keyboard has 47 printable keys, give or take a few. With shift and space bar: 95 characters. ASCII is a 7-bit system (128 choices) that’s big enough to handle it. Leftover slots are used for line feeds, tabs and the like.

Apple expanded ASCII to 8 bits. That’s what Goldenseal uses. The extra 128 characters include diacritics (ßàæçñòóôõö etc), Greek letters, extra currency symbols, and fancier punctuation. Other companies also expanded ASCII, but everyone used different setups. It’s why you’ll sometimes get emails with weird symbols.

8 bits is sufficient to cover most European languages. But move east, and there are whole new alphabets. Too much for just one byte.

For a while in the 90s and early Aughts, the solution was wchar_t. It’s 16-bit text, with 65,536 possible characters. That’s enough to hold alphabets for Britain, Thailand, and all points between. Wide characters support العربية (Arabic), کوردی (Kurdish), हिंदी (Hindi) etc.

That era was not fun. Some things still needed ASCII, some needed wchar_t. Use the wrong one and you’d get gibberish or worse.

Microsoft created a special hell, with LPSTR, LPCSTR, LPWSTR, LPCWSTR, LPTSTR, LPTCSTR, CStringA, CStringW, bstr_t, CComBSTR, WCHAR and TCHAR, all for different types of 8 or 16-bit text. There were equally obscure ways to convert between them: CW2A, C2AEX, etc. Use the wrong one and code would crash: sometimes suddenly, sometimes randomly later.

Move on to East Asia and wchar_t had another problem. China/Japan/Korea use ideographs: a different symbol for each word. Many thousands of them in each dialect. 2-byte text lacked space for them, so it wasn’t good enough for global use. It also was twice as bulky for regular Latin-language text.

Unicode fixed all that. It’s a clever system that mixes characters of different sizes: anything from 1-byte Latin to 4-byte 𒁎 (Ancient Sumerian). With over 4 billion possible glyphs, there’s room for every human language, current or extinct. Plus music notation. Emojis. Dingbats. Weird math symbols. And plenty more.

Even better, Unicode converts easily to UTF-8. That’s an 8-bit format that programmers can use to treat Unicode just like simple Latin text. Thanks to UTF-8, text-handling code inside our new accounting software accepts Unicode with no need to rewrite anything.

The new accounting app will have a few problems with Unicode, but nothing serious. Find works OK with special characters or emojis, but sorting won’t know what to do with non-Latin text. You’ll need to be careful with fonts: none support the entire gamut of Unicode, and some may be Latin-only. Other quirks may arise.

The next thing for us to test is rich text: with multiple fonts and formats inside. The Qt class we use for multi-line fields supports it. If we’re lucky, that also will be easy.

Dennis Kolva
Programming Director
TurtleSoft.com

Author: Dennis Kolva

Programming Director for Turtle Creek Software. Design & planning of accounting and estimating software.