Rename BertTransformer to EffectiveTransformer. (#12)
Rename BertTransformer to EffectiveTransformer.
Rename the python related api from bert_transformer to effective_transformer.
Renew effective transformer python so. Built with gcc7 & cuda10 & tf1.15.3
Co-authored-by: Cheng CHEN chencheng.kit@bytedance.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
Effective Transformer
Effective Transformer is built on top of the NVIDIA open sourced project FasterTransformer with many advanced optimizations. Our experiments show Effective Transformer can significantly reduce the execution time and memory consumption, especially for large batch size cases.
Running BERT without Padding
When using BERT to encode a batch of input sequences, we usually treat the input batch as a matrix whose column number equals to the maximum length of all sequences. NVIDIA FasterTransformer can process cases that all sequences have roughly the same length very efficiently. However, if the lengths of sequences in the same batch vary a lot, padding them into the same length means a big waste of both memory and computation resources.
Consider the following case
this input includes 3 sequences and the maximum length is 5. If we just simply treat it as a 3x5 matrix, only 7 out of 15 values are meaningful.
In Effective Transformer, we still take the input batch as a padded matrix but padding values will be dynamically removed and restored during different calculation stages.
By calculating the prefix sum of the input mask matrix, we can access real inputs in each sequence in a matrix with no padding values. The following figure illustrates how to access valid inputs and dynamically remove and restore padding values during the calculation. All valid inputs are colored in green while padding values are colored in gray.
Environment requirements
Features
Performance
BERT-Base, layers=12, head_num=12, hidden_size=64
Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
serquence length generated by
Tesla V100, float16, maximum sequence length=32, average serquence length≈20
Tesla V100, float16, maximum sequence length=64, average serquence length≈40
Tesla V100, float32, maximum sequence length=64, average serquence length≈40
Tesla T4, float16, maximum sequence length=32, average serquence length≈20
Tesla T4, float16, maximum sequence length=64, average serquence length≈40
Run demo
Using python prebuilt packege requires python3.5+ tensorflow1.15.x cuda10.0, tested on debian9.
Build from source
TF_PATH : path to libtensorflow_framework.so