[Interest] Splitting a unicode text into "characters" for RFC 2047 encoding

Tue Dec 25 17:25:07 CET 2012

Hi,
RFC 2047 is a standard which defines the proper way of encoding Unicode data in context of MIME message headers (i.e. it describes the way of using non-ASCII stuff in e-mail subjects, human-readable names in the From/To headers etc). There's no support for that in Qt, so I wrote my own implementation (after trying the one from the Qt Messaging Framework which unfortunately did not work so well in real world). It's available at [1].

One step in the process involves splitting the Unicode string into a series of "chunks" where the size of the each encoded chunk is limited by some constant. The Unicode chunk has to be encoded in some Unicode encoding (like the UTF-8, for example) and the resulting byte array has to be envoded again either via the Base64 or the Quoted-Printable scheme. It is the size of the result of all these transformations which counts.

When decoding these chunks, the decoder takes each chunk, reverses the base64 or Q-P encoding and then uses an appropriate Unicode decoder (like the UTF-8 one, or the one for Latin1,...) to convert the array of bytes back into the Unicode string. This means that each chunk has to be "self contained" (I've seen buggy implementations which are happy to split between the two bytes required for UTF-8 representation of the "á" character, for example).

I've tried to do my best in this area, iterating over the individual QChars which together make the string (see lines 252..268 of [1]). However, I know that there are certain "combining characters" in Unicode, i.e. that the "á" character I used earlier can actually be created as an ordinary "a" followed by a special symbol. I also suspect that certain combinations can only be expressed through the combining syntax which means that QString::normalize() won't help me.

I think that when testing whether a string can be split at a particular index, my code shall check whether the next symbol is a "combinig character". However, I know nothing about various non-latin scripts, and I wasn't able to tell which method of a QChar shall I use in this context. It looks like QChar::isHighSurrogate() and QChar::isLowSurrogate() will be part of the solution, but they apparently only work for non-BMP characters.

In short, I know nothing about Unicode details, but want to split the string at offsets where it is "safe". How do I tell where to split?

(As a side note, is there any interest in having a RFC2047 encoder/decoder in Qt? I'll be happy to make this part of Qt5 if people are interested and I have some free time.)

With kind regards,
Jan

[1] http://quickgit.kde.org/?p=trojita.git&a=blob&f=src/Imap/Encoders.cpp

-- 
Trojitá, a fast Qt IMAP e-mail client -- http://trojita.flaska.net/