[Interest] Splitting a unicode text into "characters" for RFC 2047 encoding

Thiago Macieira thiago.macieira at intel.com
Tue Dec 25 22:11:51 CET 2012


On terça-feira, 25 de dezembro de 2012 17.25.07, Jan Kundrát wrote:
> I think that when testing whether a string can be split at a particular
> index, my code shall check whether the next symbol is a "combinig
> character". However, I know nothing about various non-latin scripts, and I
> wasn't able to tell which method of a QChar shall I use in this context. It
> looks like QChar::isHighSurrogate() and QChar::isLowSurrogate() will be
> part of the solution, but they apparently only work for non-BMP characters.
> 
> In short, I know nothing about Unicode details, but want to split the string
> at offsets where it is "safe". How do I tell where to split?

Are you sure you need to keep the combining characters together in the same 
RFC 2047 chunk?

If you do, you can use QChar::category [1] and check for the category type 
QChar::Mark_SpacingCombining. If you run into a surrogate type, then get the 
two surrogates, calculate the UCS4 value (see QChar::surrogateToUcs4 [2]) and 
try again.

My currently-experimental QStringIterator class [3] would return UCS 4 values 
when iterating over a string.

[1] http://qt-project.org/doc/qt-5.0/qtcore/qchar.html#category-2
[2] http://qt-project.org/doc/qt-5.0/qtcore/qchar.html#surrogateToUcs4
[3] https://codereview.qt-project.org/669
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.qt-project.org/pipermail/interest/attachments/20121225/a89ec5cc/attachment.sig>


More information about the Interest mailing list