• notabot
    link
    fedilink
    arrow-up
    16
    ·
    4 months ago

    There’s a difference between ‘processing’ the text and ‘parsing’ it. The processing described in the section you posted it fine, and you can manage a similar level of processing on HTML. The tricky/impossible bit is parsing the languages. For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence, and you can’t write a regex that’ll break an HTML document down into a hierarchy of tags as regexs don’t support counting depth of recursion, and HTML is irregular anyway, meaning it can’t be reliably parsed with a regular parser.

    • Blue_Morpho@lemmy.world
      link
      fedilink
      arrow-up
      3
      arrow-down
      6
      ·
      4 months ago

      For instance you can’t write a regex that’ll relibly find the subject, object and verb in any english sentence

      Identifying parts of speech isn’t a requirement of the word parse. That’s the linguistic definition. In computer science identifying tokens is parsing.

      https://en.m.wikipedia.org/wiki/Parsing

      • notabot
        link
        fedilink
        arrow-up
        9
        ·
        4 months ago

        That’s certainly one level of parsing, and sometimes alk you need, but as the article you posted says, it more usually refers to generating a parse tree. To do that in a natural language isn’t happening with a regex.

        • uranibaba@lemmy.world
          cake
          link
          fedilink
          arrow-up
          1
          ·
          4 months ago

          Thanks for all the explaining. I always wondered why you can’t parse HTML since I first saw the Stack Overflow post, when you can take any HTML code you find and write an expression to work against said set of data.

          I never understood the word parse to mean understanding and building a structure based on any input.