Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike

Bart Jongejan, Hercules Dalianis

    Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

    58 Citationer (Scopus)
    5354 Downloads (Pure)

    Abstract

    We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and
    Swedish full form-lemma pairs respectively.
    We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lemmatizer.
    Icelandic deteriorated with 1.9 percent. We also made an observation regarding the number of produced lemmatization rules as a function of the number of training pairs.
    OriginalsprogEngelsk
    TitelProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
    Antal sider9
    Vol/bind1
    ForlagAssociation for Computational Linguistics
    Publikationsdato2009
    Sider145-153
    ISBN (Trykt)978-1-932432-61-9
    ISBN (Elektronisk)1-932432-61-2
    StatusUdgivet - 2009
    BegivenhedACL-IJCNLP 2009 - Singapore, Singapore
    Varighed: 2 aug. 20097 aug. 2009
    Konferencens nummer: 47

    Konference

    KonferenceACL-IJCNLP 2009
    Nummer47
    Land/OmrådeSingapore
    BySingapore
    Periode02/08/200907/08/2009

    Emneord

    • Det Humanistiske Fakultet
    • lemmatisering morfologi affiks

    Citationsformater