@article{zhao2024harmonizing,
title={Harmonizing Visual Text Comprehension and Generation},
author={Zhao, Zhen and Tang, Jingqun and Wu, Binghong and Lin, Chunhui and Wei, Shu and Liu, Hao and Tan, Xin and Zhang, Zhizhong and Huang, Can and Xie, Yuan},
journal={arXiv preprint arXiv:2407.16364},
year={2024}
}
Harmonizing Visual Text Comprehension and Generation
Environment
step 1: set up the environment
some of the packages like mmcv and flash_attn in requirements.txt may need to be installed manually.
step 2: download pretraining weights
step 3: download the model weight of TextHarmony
Inference
step1: modify ‘load_from’, ‘llm_model_path’, ‘encoder_model_path’ and ‘pretrained_model_name_or_path’ in example_inference.yaml
step 2: run the following command:
Evaluation
image comprehension
step1: modify ‘data_root’ and ‘data_path’ in 896-moe-eval.yaml. The structure of ‘data_path’ should be as follows:
step 2: run the following command
image generation
step 1: download AnyText-Benchmark
step 2: generate the target images
step 3: calculate the results
Training
Acknowledgment
We thank the great work of MM-Interleaved, TextDiffuser, AnyText and LoRAMoE
Citation