目录

Sequential Hidden Decoding

Scaling Sequence Length in Pretraining

WeChat AI, Tencent

Blog   License   Models


Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token


Updates

  • 2026-04-01: Released Sequential-Hidden-Decoding-8B-n8-Instruct — our first instruction-tuned model, outperforming Qwen3-8B-Instruct on all 8 benchmarks. Added Chat UI for interactive conversations.
  • 2026-03-10: Initial release of base models (n=2, n=4, n=8) and SGLang inference patch.

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.


Results

Base Model

Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):

Benchmark # Shots 8B Baseline 8B scale n=2 8B scale n=4 8B scale n=8
BBH (EM) 3-shot 78.8 81.3 83.0 83.9
MMLU (EM) 5-shot 79.8 80.9 81.9 82.2
MBPP+ (Pass@1) 1-shot 66.7 69.4 68.7 69.4
MATH (LLM-judge) 4-shot 56.0 58.2 60.0 61.1
ARC-C 25-shot 93.9 94.3 94.4 94.7
Hellaswag 10-shot 79.7 83.1 85.0 85.3
GSM8K 4-shot 92.5 93.3 93.9 94.6

Instruct Model

Instruction-tuned from Sequential-Hidden-Decoding-8B-n8, compared with:

  • Qwen3-8B-Instruct: the official instruction-tuned model released by the Qwen team.
  • Qwen3-8B SFT: our own SFT baseline trained on Qwen3-8B-Base with exactly the same data as Scale-Seq n=8 SFT, but without Sequential Hidden Decoding. This isolates the effect of scale-seq from the training data.
Benchmark Category Qwen3-8B-Instruct Qwen3-8B SFT Scale-Seq n=8 SFT
AIME24 Math 29.42 19.09 30.00
AIME25 Math 20.58 17.67 26.25
MATH500 Math 85.16 84.19 89.04
GPQA Diamond Reasoning 48.79 50.91 55.25
CMMLU Knowledge 78.20 75.05 81.15
C-Eval Knowledge 78.67 76.04 83.20
MMLU-Pro Knowledge 65.84 68.35 75.70
SuperGPQA Knowledge 36.79 41.20 48.35

Sequential Hidden Decoding n=8 SFT outperforms Qwen3-8B-Instruct on all 8 benchmarks, including competitive math benchmarks (AIME24/25) and knowledge-intensive tasks (MMLU-Pro, SuperGPQA). Sampling: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5, max_tokens=4096. Judge: GPT-4o.


Models

Base Models

All models share the same 8B Transformer backbone — only the Embedding parameters grow:

Model Scale Embedding Params Training Tokens Link
Sequential-Hidden-Decoding-8B-n2 1.9B 75B HuggingFace
Sequential-Hidden-Decoding-8B-n4 3.1B 150B HuggingFace
Sequential-Hidden-Decoding-8B-n8 5.6B 187B HuggingFace

Instruct Models

Model Scale Base Model Link
Sequential-Hidden-Decoding-8B-n8-Instruct Sequential-Hidden-Decoding-8B-n8 HuggingFace

Installation (Inference)

docker pull aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126

Option 2: Use the Forked SGLang

git clone https://github.com/exlaw/sglang.git
cd sglang
pip install -e "python[all]"

Option 3: Apply Patch Manually

Apply the patch on top of SGLang:

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 4efe844a2
git apply /path/to/hidden_decoding.patch
pip install -e "python[all]"

Serving

Launch the server:

python -m sglang.launch_server \
    --model-path tencent/Sequential-Hidden-Decoding-8B-n8-Instruct \
    --trust-remote-code \
    --tp-size 1 \
    --port 8080 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128
Docker
docker run --gpus all -p 8080:8080 -p 8081:8081 \
  -v /path/to/models:/models \
  aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
  bash -c "python -m sglang.launch_server \
    --model-path /models/Sequential-Hidden-Decoding-8B-n8-Instruct \
    --trust-remote-code \
    --tp-size 1 \
    --port 8080 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128"

Note: Sequential Hidden Decoding models process n×-length sequences internally, so --chunked-prefill-size -1 (disable chunked prefill), --attention-backend fa3, and reduced batch sizes are important for stability and performance. Adjust --tp-size for multi-GPU setups.


Usage

Chat Completions (Instruct Model)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the concept of hidden decoding in simple terms."},
    ],
    max_tokens=512,
    temperature=0.7,
)
print(response.choices[0].message.content)

Text Completions (Base Models)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.completions.create(
    model="default",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

Chat UI

We provide a lightweight, zero-dependency web chat interface for interactive conversations:

python chat_ui.py

Then open http://localhost:8081 in your browser. The Chat UI connects to http://localhost:8080/v1 by default.

Docker (serve model + Chat UI together)
docker run --gpus all -p 8080:8080 -p 8081:8081 \
  -v /path/to/models:/models \
  -v /path/to/chat_ui.py:/app/chat_ui.py \
  aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
  bash -c "python -m sglang.launch_server \
    --model-path /models/Sequential-Hidden-Decoding-8B-n8-Instruct \
    --trust-remote-code \
    --tp-size 1 \
    --port 8080 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128 &
  sleep 5 && python /app/chat_ui.py"

Open http://localhost:8081 — it connects to the model server automatically.

Features:

  • Streaming responses with real-time token speed display
  • Thinking/reasoning block visualization
  • Configurable system prompt, temperature, and max tokens
  • Zero dependencies — pure Python standard library


Patch Contents

The patch adds the qwen3_scale_seq model architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.


Citation

If you find this work useful, please cite our blog post:

@article{hidden_decoding_2026,
  title   = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

Contact

Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)

License

This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.

关于
386.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号