apertium/apertium-weights-learner

Name: apertium-weights-learner

Owner: Apertium

Description: null

Created: 2018-03-28 09:47:45.0

Updated: 2018-03-28 10:00:34.0

Pushed: 2018-04-17 15:23:06.0

Homepage: null

Size: 59

Language: Python

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

apertium-weights-learner

This is a python3 script that can be used for transfer weights training (see http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Weighted_transfer_rules). For now, it only allows for fully lexicalized patterns to be extracted (i.e., a sequence of tokens with lemmas and full sets of tags).

Prerequisites

To run this version of transfer weights training for a given language pair, you need:

Prepare language model

In order to run the training, you need to make a language model for your target language.

Run training
Sample run

In order to ensure that everything works fine, you may perform a sample run using prepared corpus:

The sample file new-software-sample.txt contains three selected lines with 'new software' and 'this new software' patterns, each of which triggers a pair of ambiguous rules from apertium-en-es.en-es.t1x file, namely ['adj-nom', 'adj-nom-ns'] and ['det-adj-nom', 'det-adj-nom-ns']. Speaking informally, these rules are used to transfer sequences of (adjective, noun) and (determiner, adjective, noun). The first rule in each ambiguous pair specifies that the translations of the adjective and the noun are to be swapped, which is usual for Spanish, hence these rule are specified before their '-ns' counterparts indicating that these are the default rules. The second rule in each ambiguous pair specifies that the translations of the adjective and the noun are not to be swapped, which sometimes happens and depends on lexical units involved.

The contents of the unpruned w1x file without generalizing patterns should look like the following:

l version='1.0' encoding='UTF-8'?>
nsfer-weights>
ule-group>
<rule comment="REGLA: ADJ NOM" id="adj-nom" md5="72e0f329e4cb29910163fa9c9d617ec4">
  <pattern weight="0.2940047506474463">
    <pattern-item lemma="new" tags="adj.sint"/>
    <pattern-item lemma="software" tags="n.sg"/>
  </pattern>
</rule>
<rule comment="REGLA: ADJ NOM no-swap-version" id="adj-nom-ns" md5="7df4382f378bae45d951c79e287a31e6">
  <pattern weight="1.7059952493525534">
    <pattern-item lemma="new" tags="adj.sint"/>
    <pattern-item lemma="software" tags="n.sg"/>
  </pattern>
</rule>
rule-group>
ule-group>
<rule comment="REGLA: DET ADJ NOM" id="det-adj-nom" md5="897a67e4ffadec9b7fd515ce0a8d453b">
  <pattern weight="0.262703645221423">
    <pattern-item lemma="its" tags="det.pos.sp"/>
    <pattern-item lemma="own" tags="adj"/>
    <pattern-item lemma="code" tags="n.sg"/>
  </pattern>
  <pattern weight="0.05124922803710481">
    <pattern-item lemma="this" tags="det.dem.sg"/>
    <pattern-item lemma="new" tags="adj.sint"/>
    <pattern-item lemma="software" tags="n.sg"/>
  </pattern>
</rule>
<rule comment="REGLA: DET ADJ NOM no-swap-version" id="det-adj-nom-ns" md5="13f1c5ed0615ae8f9d3142aed7a3855f">
  <pattern weight="0.737296354778577">
    <pattern-item lemma="its" tags="det.pos.sp"/>
    <pattern-item lemma="own" tags="adj"/>
    <pattern-item lemma="code" tags="n.sg"/>
  </pattern>
  <pattern weight="0.9487507719628953">
    <pattern-item lemma="this" tags="det.dem.sg"/>
    <pattern-item lemma="new" tags="adj.sint"/>
    <pattern-item lemma="software" tags="n.sg"/>
  </pattern>
</rule>
rule-group>
ansfer-weights>

This would mean that '-ns' versions of both rules are preferred for each pattern, which tells the transfer module that the translations of 'new' and 'software' should not be swapped (as specified in '-ns' versions of both rules), since in Spanish the adjective 'nuevo' is usually put before the noun as opposed to the fact that most adjectives are put after the noun.

Generalizing the patterns

Setting parameter generalize to yes in config file allows the learning script to learn partially generalized patterns as well, i.e. lemmas are partially removed from the pattern in all possible combinations and stored with the same scores as for the full pattern.

Pruning

You can also prune the obtained weights file with prune.py script from 'tools' folder. Pruning is a process of eliminating redundant weighted patterns, i.e.: For each rule group: for each pattern that is present in more than one rule:

The idea behind the pruning process is that in fact, we only want to weight exceptions from the default rule. Pruned weights file doesn't offer any significant speed advantages with the current realization but it still reduces memory footprint at translation time and this allows to learn weights from bigger corpora.

Removing generalized patterns

If you just killed 5 hours of your machine time to obtain a weights file with generalized patterns and then suddenly realized that you want a file without them as well, you can use remgen.py from 'tools' folder to achieve exactly that.

Testing

Once the weights are obtained, their impact can be tested on a parallel corpus using the 'weights-test.sh' script from the 'testing' folder, which contains a simple config akin to the weights learning script. If you want to test your weights specifically on the lines containing ambiguous chunks, you can first run your test corpora through condense.py script from 'tools' folder.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.