I can’t believe nobody has done this list yet. I mean, there is one about names, one about time and many others on other topics, but not one about languages yet (except one honorable mention that comes close). So, here’s my attempt to list all the misconceptions and prejudices I’ve come across in the course of my long and illustrious career in software localisation and language technology. Enjoy – and send me your own ones!

  • 2xsaiko@discuss.tchncs.de
    link
    fedilink
    arrow-up
    6
    ·
    13 hours ago

    Segmenting a text into sentences is as easy as splitting on end-of-sentence punctuation.

    Is there a language this actually isn’t true for? It seems oddly specific like a lot of the others and I don’t think I know of one that does this. Except maybe some wack ass conlangs of course.

    • Giooschi@lemmy.world
      link
      fedilink
      English
      arrow-up
      14
      ·
      edit-2
      10 hours ago

      Even in english this isn’t true, for example dots can appear inside a sentence for multiple reasons (a decimal number, an abbreviation, a quotation, three dots, etc, etc), which would make you split it into more than one piece.

    • TehPers@beehaw.org
      link
      fedilink
      English
      arrow-up
      14
      ·
      edit-2
      12 hours ago

      English. I can go to the store and buy a sandwich for $8.99 all in one sentence, but splitting it on periods gives you two sentences.

        • Björn Lindström@social.sdfeu.org
          link
          fedilink
          arrow-up
          1
          ·
          8 hours ago

          @2xsaiko @TehPers there’s other examples too. E.g. Thai has no spaces between words but spaces between phrases/sentences. However the spaces between phrases involve style choices similar to comma in English and many other Latin script writing systems. Also, Thai may have spaces around abbreviations special characters.

          I’m quite familiar with Thai so that’s close at hand but I guess it’s the same in a lot of other writing systems based on Brahmic scripts.