目录

Controlling Language Difficulty in Dialogues with Linguistic Features

License arXiv

Dilaprix: A metric for quantifying and regulating language difficulty in dialogues using linguistic features.

📌 Introduction

The Dialogue Language Proficiency Index (Dilaprix) is a composite metric that evaluates the linguistic complexity of dialogue utterances based on three categories of features:

  • Readability features (e.g., Flesch-Kincaid Grade Level)
  • Syntactic features (e.g., syntactic tree depth)
  • Lexical features (e.g., simple word ratio)

Dilaprix enables fine-grained control over language difficulty—useful for educational dialogue systems, accessibility tools, and language learning applications.

📊 Example: Utterances vs. Dilaprix Scores

Utterance Dilaprix
Thank you for coming, Lily. Do you like meat? 0.08
Thank you for coming, Lily. I appreciate your help in the kitchen. To start with, do you like meat? 0.30
Thank you for coming, Lily. I appreciate your help in the kitchen. To better understand your preferences, may I ask: do you like meat? 0.55
Ah, excellent, Lily, for you to grace us with your presence in the kitchen. Now, to delve into a gastronomical inquiry: do you have an affinity for meat? 0.81

🔍 Lower Dilaprix = simpler language; Higher Dilaprix = more complex language


🧠 Linguistic Features

Dilaprix integrates the following 11 features:

Readability

  • Flesch Reading Ease (FRF_R): Higher = easier to read.
  • Flesch-Kincaid Grade Level (FGF_G): US grade level estimate.
  • Gunning Fog Index (GFG_F): Based on sentence length and complex words (≥3 syllables).
  • Coleman-Liau Index (CLC_L): Uses character counts instead of syllables.

Syntax

  • Tree Depth (TDT_D): Max depth of syntactic parse trees.
  • Leaf Node Count (LNL_N): Max number of leaf nodes in any sentence.
  • Non-terminal Diversity (NDN_D): Unique non-terminal tags in parse trees.
  • Subtree Complexity (SCS_C): Max number of sub-trees per sentence.
  • Utterance Length (ULU_L): Total tokens.

Lexicon

  • Simple Word Ratio (SWS_W): Proportion of words in a simple vocabulary list.
  • Intermediate Word Ratio (IWI_W): Proportion in an intermediate vocabulary list.

📐 Dilaprix Formula

The final score is computed as:

Where:

  • X={FR,FG,GF,CL,TD,LN,ND,SC,UL,SW,IW}\mathcal{X} = \{F_R, F_G, G_F, C_L, T_D, L_N, N_D, S_C, U_L, S_W, I_W\}
  • X={FR,SW,IW}\mathcal{X}' = \{F_R, S_W, I_W\}: features inversely related to difficulty
  • αi\alpha_i, βi\beta_i: 5th and 95th percentiles from a textbook dialogue corpus (used for robust normalization)
  • clamp(v,0,1)\text{clamp}(v, 0, 1): ensures output stays in [0,1][0, 1]

🚀 Get Started

Installation

cd language_difficulty_control
pip install -e .

Usage

from language_difficulty_control import LinguisticAnalyzer

analyzer = LinguisticAnalyzer()
features = analyzer("Hello! How are you today?")
dilaprix = features["dilaprix"]
print(f"Dilaprix: {dilaprix:.2f}")

Output

Dilaprix: 0.06

Language Proficiency Controlled Dialogue Prompt Example

[flesch_reading_ease] for the Flesch-Kincaid Reading Ease;
[flesch_kincaid_grade_level] for Flesch-Kincaid Grade Level;
[gunning_fog] for the Gunning Fog Index;
[coleman_liau] for the Coleman Liau Index;
[tree_depth] The max Depth of the Constituency Parsing Trees of the sentences in your response;
[leaf_node_count] The max number of leaf nodes of the Constituency Parsing Trees of the sentences in your response;
[non_terminal_diversity] The max number of unique tags of the Constituency Parsing Trees of the sentences in your response;
[subtree_complexity] The max number of sub-trees of the Constituency Parsing Trees of the sentences in your response;
[utterance_length] the number of words in your response;
[simple_words_ratio] the ratio of simple words in your response;
[intermediate_words_ratio] the ratio of simple and intermediate words in your response.

You are given a context and dialogue tasks, and are asked to play a role to continue the following conversation naturally.

[DIALOGUE TASKS]
1. Ask Anna if she can play the piano
2. Ask Anna if she can ride a bike
[CURRENT DIALOGUE TASK]
2. Ask Anna if she can ride a bike
[CONTEXT]
Ming: Hi Anna, can you play the piano?
Anna: Yes, I can.
Your reply should consist of two parts:

1. First part should respond to the user kindly based on the context;
2. Second part should carry out the [CURRENT DIALOGUE TASK].

Additionally, your response should abide by the following linguistic features:
[flesch_reading_ease] 86.42
[flesch_kincaid_grade_level] 3.07
[gunning_fog] 3.0
[coleman_liau] 2.99
[tree_depth] 9
[leaf_node_count] 10
[non_terminal_diversity] 14
[subtree_complexity] 22
[utterance_length] 18
[simple_words_ratio] 0.8
[intermediate_words_ratio] 1.0

📚 Citation

@misc{xu2025controllinglanguagedifficultydialogues,
      title={Controlling Language Difficulty in Dialogues with Linguistic Features}, 
      author={Shuyao Xu and Wenguang Wang and Handong Gao and Wei Kang and Long Qin and Weizhi Wang},
      archivePrefix={arXiv},
      url={https://arxiv.org/abs/2509.14545}, 
}

📄 License

This project is licensed under the Apache License – see the LICENSE file for details.

关于
2.2 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号