[Development] Qt::CaseInsensitive comparison is not the same as toLower() comparison

Wed Feb 10 22:05:28 CET 2016

2016-02-10 23:46 GMT+04:00 Knoll Lars <Lars.Knoll at theqtcompany.com>:

> Hi Thiago,
>
> On 10/02/16 19:27, "Development on behalf of Thiago Macieira" <
> development-bounces at qt-project.org on behalf of thiago.macieira at intel.com>
> wrote:
>
>
>
>
>
> >Hi all
> >
> >(especially Konstantin!)
> >
> >When reviewing a change, I noticed that QString::startsWith with
> >CaseInsensitive compares things like this:
> >
> >            if (foldCase(data[i]) != foldCase((ushort)latin[i]))
> >                return false;
> >
> >with foldCase() being
> convertCase_helper<QUnicodeTables::CasefoldTraits>(ch),
> >whereas toLower() uses QUnicodeTables::LowercaseTraits.
> >
> >There's a slight but important difference in a few character pairs see
> below.
> >The code has been like that since forever. So I have to ask:
> >
> >       => Is this intended?
>
> Yes, CaseInsensitive comparisons should compare case folded strings. And
> you're right, in some cases that is something else than comparing the lower
> cased versions of both strings.
>
>
> >If you write code like:
> >
> >       qDebug() << a.startsWith(b, Qt::CaseInsensitive)
> >               << (a.toLower() == b.toLower());
> >
> >You'll get a different result for the following pairs (for example, see
> util/
> >unicode/data/CaseFolding.txt for more):
> >
> >µ U+00B5 MICRO SIGN
> >μ U+03BC GREEK SMALL LETTER MU
> >
> >s U+0073 LATIN SMALL LETTER S
> >ſ U+017F LATIN SMALL LETTER LONG S
> >
> >And then there are the differences between toUpper and toLower. The
> following
> >pairs compare false with toLower(), compare true with toUpper(), currently
> >compare false with CaseInsensitive/toCaseFolded() but *should* compare
> >true[1]:
>
> They should only compare true with full case folding rules. This is
> something we have so far not implemented in Qt; as you noted below we're
> still using simple case folding rules.
>
> This is actually somewhat similar to the case of comparing strings in
> different normalisation forms (composed vs decomposed), something we also
> don't do out of the box (ie. 'Ä' doesn't compare true with the combination
> of 'A' with the diacritical mark for umlaut.
>
> At some point it would probably be nice to implement support for comparing
> these correctly. This does however have a performance impact and many use
> cases might not want this comparison by default.
>

Right, we still don't have "full" case folding support in Qt; but even if
we did, it requires a locale (or at least language) to be passed as a
comparison context, which means we'll have to review our string comparison
API -- so maybe Qt6?

w3c says:

    If your application or specification needs to consider case folding,
here are some general recommendations to follow:

   1. *Consider Unicode Normalization* in addition to case folding. If you
   mean to find text that is semantically equal, you may need to normalize the
   text beyond just case folding it. Note that Unicode Normalization does
   *not* include case folding: these are separate operations.
   2. *Always use the language (locale) when case folding.* Some languages
   have specific case folding idiosyncrasies. In particular, if you do not
   pass the language or locale to your case folding routine, you may get a
   default locale which might be Turkish (for example).
      1. *Specify US English or "empty" (root) locale if you need
      consistent (internal, not for presentation) comparisons* If your
      comparison should be the same regardless of language or locale,
always pass
      the US English or empty (root, C, POSIX, null) locale to your
case-folding
      function. This does not disable caseless comparison or case folding. It
      merely limits the effects to a well-known set of rules.
      2. *Use case-less compare functions if provided* If your application
      is comparing internal values for equality (as opposed to sorting lists or
      comparing values linguistically), you should use a consistent caseless
      compare function. For example, Java's java.lang.String class provides
      an equalsIgnoreCase function that is more convenient than using
      toLowerCase(#locale)... which provides consistent results across
      languages, although not consistent with the rules for any given language.
   3. *For presentation, normalize case in a language sensitive manner* The
   rules that one language uses for case will not necessarily match those used
   by another language. For example, the French novel by Marcel Proust *À
   la recherche du temps perdu* contains only the single, introductory
   capital letter, whereas the English title uses "titlecase": *In Search
   of Lost Time*. Code that assumes one form of capitalization is
   appropriate for another language may cause problems.

In Qt, we usually don't have to do 1 ourselves as we put
this responsibility to the calling side; that saying, we always assume both
texts are in normalized form (usually composed).
2 relates to the "full" case folding -- clearly describes *why* it needs a
context. IIRC, it is similar to QLocale::toLower(). And we have even more
powerful (but less cheap) QCollator & Co.
3 is out of scope; our title casing algorithm doesn't respect the locale
(actually, I doubt any one ever uses our title casing algorithm, so
probably it is a dead weight).

Regards,
Konstantin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.qt-project.org/pipermail/development/attachments/20160211/8f4dabaf/attachment.html>