Note: If you are using a server outside of China, you’d better delete two tsinghua mirrors in environment.yml line 3-4 and setup_env.sh line 9 for a better speed.
The hyperparameter set is transformer_big_single_gpu.
We will use only 1 GPU.
The model will evaluate the dev loss and save the checkpoint every 1000 steps.
If the dev loss doesn’t decrease for 14500 steps, the training will stop.
When decoding the test set, we will use beam size 4 and use alpha value of 1.0.
The larger the alpha value, the longer the generated translation will be.
processed indicates the version of the processed files. Here is config/processed.ru_zh.json:
It indicates that the training folder is data/raw/train.ru_zh, dev folder is data/raw/dev.ru_zh and test folder is data/raw/dev.ru_zh, i.e. we use the dev as test.
The preprocessing pipeline will use byte-pair-encoding (BPE) and the number of merge operations are 30000.
setp 4: decode_test : decode test with all combinations of (beam, alpha)
After step 4, all the decoded results will be in folder data/run/ru_zh.big.single_gpu_tmp/decode:
decode.b4_a1.0.test0.txt: the decoded BPE subwords using beam size 4 and alpha value 1.0.
decode.b4_a1.0.test0.tok: the decoded tokens when we merge the BPE subwords into whole words.
decode.b4_a1.0.test0.char: the decoded utf8 characters of decode.b4_a1.0.test0.tok after removing space.
bleu.b4_a1.0.test0.tok: the token level BLEU score.
bleu.b4_a1.0.test0.char: the character level BLEU score.
The reference files are in folder data/run/ru_zh.big.single_gpu_tmp/decode.
Note
We have released the dev set on Codalab. You can submit your system outputs on Codalab to get the Bleu score on the released dev set. You can also download the dev set by registering to the competition on Codalab
Independent Evaluation Script
Folder eval contains the evaluation scripts to calculate the character-level BLEU score:
cd eval
python bleu.py hyp.txt ref.txt
Where hyp.txt and ref.txt can be either normal Chinese (i.e. without space between characters) or character-split Chinese.
Baseline code for WMT 2021 Triangular MT
Updated on 04/07/2021.
The baseline code for the shared task
Triangular MT: Using English to improve Russian-to-Chinese machine translation.NOTE
All scripts should run from root folder:
Requirements
A linux machine GPU and installed CUDA >= 10.0
Setup
minicondaon your machine.setup_env.shwith interactive mode: Note: If you are using a server outside of China, you’d better delete two tsinghua mirrors inenvironment.ymlline3-4andsetup_env.shline9for a better speed.Registration
To participate please register to the shared task on Codalab .
Link to Codalab website.Detailed Configuration
We will use the toolkit
tensor2tensorto train a Transformer based NMT system.config/run.ru_zh.big.single_gpu.jsonlists all the configurations.The hyperparameter set is
transformer_big_single_gpu. We will use only1GPU. The model will evaluate the dev loss and save the checkpoint every1000steps. If the dev loss doesn’t decrease for14500steps, the training will stop. When decoding the test set, we will use beam size4and use alpha value of 1.0. The larger the alpha value, the longer the generated translation will be.processedindicates the version of the processed files. Here isconfig/processed.ru_zh.json:It indicates that the training folder is
data/raw/train.ru_zh, dev folder isdata/raw/dev.ru_zhand test folder isdata/raw/dev.ru_zh, i.e. we use the dev as test. The preprocessing pipeline will use byte-pair-encoding (BPE) and the number of merge operations are30000.Train and Decode
To train a Russian to Chinese NMT system:
1is the start step and4is the end step.After step 4, all the decoded results will be in folder
data/run/ru_zh.big.single_gpu_tmp/decode:decode.b4_a1.0.test0.txt: the decoded BPE subwords using beam size 4 and alpha value 1.0.decode.b4_a1.0.test0.tok: the decoded tokens when we merge the BPE subwords into whole words.decode.b4_a1.0.test0.char: the decoded utf8 characters ofdecode.b4_a1.0.test0.tokafter removing space.bleu.b4_a1.0.test0.tok: the token level BLEU score.bleu.b4_a1.0.test0.char: the character level BLEU score.The reference files are in folder
data/run/ru_zh.big.single_gpu_tmp/decode.Note
We have released the dev set on Codalab. You can submit your system outputs on Codalab to get the Bleu score on the released dev set. You can also download the dev set by registering to the competition on Codalab
Independent Evaluation Script
Folder
evalcontains the evaluation scripts to calculate the character-level BLEU score:Where
hyp.txtandref.txtcan be either normal Chinese (i.e. without space between characters) or character-split Chinese.See ‘example.sh’ for detailed examples.