[Development] Qt::CaseInsensitive comparison is not the same as toLower() comparison

Wed Feb 10 20:46:49 CET 2016

Hi Thiago,

On 10/02/16 19:27, "Development on behalf of Thiago Macieira" <development-bounces at qt-project.org on behalf of thiago.macieira at intel.com> wrote:

>Hi all
>
>(especially Konstantin!)
>
>When reviewing a change, I noticed that QString::startsWith with 
>CaseInsensitive compares things like this:
>
>            if (foldCase(data[i]) != foldCase((ushort)latin[i]))
>                return false;
>
>with foldCase() being convertCase_helper<QUnicodeTables::CasefoldTraits>(ch), 
>whereas toLower() uses QUnicodeTables::LowercaseTraits.
>
>There's a slight but important difference in a few character pairs see below. 
>The code has been like that since forever. So I have to ask:
>
>	=> Is this intended?

Yes, CaseInsensitive comparisons should compare case folded strings. And you're right, in some cases that is something else than comparing the lower cased versions of both strings.
>
>If you write code like:
>
>	qDebug() << a.startsWith(b, Qt::CaseInsensitive)
>		<< (a.toLower() == b.toLower());
>
>You'll get a different result for the following pairs (for example, see util/
>unicode/data/CaseFolding.txt for more):
>
>µ U+00B5 MICRO SIGN
>μ U+03BC GREEK SMALL LETTER MU
>
>s U+0073 LATIN SMALL LETTER S
>ſ U+017F LATIN SMALL LETTER LONG S
>
>And then there are the differences between toUpper and toLower. The following 
>pairs compare false with toLower(), compare true with toUpper(), currently 
>compare false with CaseInsensitive/toCaseFolded() but *should* compare 
>true[1]:

They should only compare true with full case folding rules. This is something we have so far not implemented in Qt; as you noted below we're still using simple case folding rules.

This is actually somewhat similar to the case of comparing strings in different normalisation forms (composed vs decomposed), something we also don't do out of the box (ie. 'Ä' doesn't compare true with the combination of 'A' with the diacritical mark for umlaut.

At some point it would probably be nice to implement support for comparing these correctly. This does however have a performance impact and many use cases might not want this comparison by default.

Cheers,
Lars

>
>ß U+00DF LATIN SMALL LETTER SHARP S
>ẞ U+1E9E LATIN CAPITAL LETTER SHARP S
>SS
>
>ŉ U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
>ʼN
>
>ﬀ U+FB00 LATIN SMALL LIGATURE FF
>FF
>
>[1] CaseFolding.txt says:
># The data supports both implementations that require simple case foldings
># (where string lengths don't change), and implementations that allow full 
>case folding
># (where string lengths may grow). Note that where they can be supported, the
># full case foldings are superior: for example, they allow "MASSE" and "Maße" 
>to match.
>
>-- 
>Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel Open Source Technology Center
>
>_______________________________________________
>Development mailing list
>Development at qt-project.org
>http://lists.qt-project.org/mailman/listinfo/development