Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.
1. The Acoustic Profile
JAV audio is defined by “acoustic hell” and a low Signal-to-Noise Ratio (SNR), characterized by:
Non-Verbal Vocalisations (NVVs): A high density of physiological sounds (heavy breathing, gasps, sighs) and “obscene sounds” that lack clear harmonic structure.
Spectral Mimicry: These vocalizations often possess “curve-like spectrum features” that mimic the formants of fricative consonants or Japanese syllables (e.g., fu), acting as accidental adversarial examples that trick the model into recognizing words where none exist.
Extreme Dynamics: Volatile shifts in audio intensity, ranging from faint whispers (sasayaki) to high-decibel screams, which confuse standard gain control and attention mechanisms.
Linguistic Variance: The prevalence of theatrical onomatopoeia and Role Language (Yakuwarigo) containing exaggerated intonations and slang absent from standard corpora.
2. Temporal Drift and Hallucination
While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes contextual drift and error accumulation. Specifically, extended periods of “ambiguous audio” (silence or rhythmic breathing) cause the Transformer’s attention mechanism to collapse, triggering repetitive hallucination loops where the model generates unrelated text to fill the acoustic void.
3. The Pre-processing Paradox & Fine-Tuning Risks
Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific log-Mel spectrogram features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in “domain shift” and erroneous transcriptions. Consequently, audio processing requires a “surgical,” multi-stage approach (like VAD clamping) rather than blanket filtering.
Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of overfitting. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent “hit or miss” quality outputs.
WhisperJAV is an attempt to address above failure points. The inference pipelines do:
Acoustic Filtering: Deploys scene-based segmentation and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3].
Linguistic Adaptation: Normalizes domain-specific terminology and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in Kansai-ben) that standard BPE tokenizers fail to parse [4, 5].
Defensive Decoding: Tunes log-probability thresholding and no_speech_threshold to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., (moans)) from the final subtitle track [6, 7].
Quick Start
GUI (Recommended for most users)
whisperjav-gui
A window opens. Add your files, pick a mode, click Start.
Command Line
# Basic usage
whisperjav video.mp4
# Specify mode and sensitivity
whisperjav audio.mp3 --mode balanced --sensitivity aggressive
# Process a folder
whisperjav /path/to/media_folder --output-dir ./subtitles
Features
Processing Modes
Mode
Backend
Scene Detection
VAD
Best For
faster
stable-ts (turbo)
No
No
Speed priority, clean audio
fast
stable-ts
Yes
No
General use, mixed quality
balanced
faster-whisper
Yes
Yes
Default. Noisy audio, dialogue-heavy
fidelity
OpenAI Whisper
Yes
Yes (Silero)
Maximum accuracy, slower
transformers
HuggingFace
Optional
Internal
Japanese-optimized model, customizable
Sensitivity Settings
Conservative: Higher thresholds, fewer hallucinations. Good for noisy content.
Balanced: Default. Works for most content.
Aggressive: Lower thresholds, catches more dialogue. Good for whisper/ASMR content.
Transformers Mode (New in v1.7)
Uses HuggingFace’s kotoba-tech/kotoba-whisper-v2.2 model, which is optimized for Japanese conversational speech:
pass1_primary / pass2_primary: Prioritize one pass, fill gaps from other
full_merge: Combine everything from both passes
Speech Enhancement tools (New in v1.7.3)
Pre-process audio scenes. When selected runs per-scene after scene detection.
Note: Only use for surgical reasons. In general any audio processing that may alter mel-spectogram has the potential to introduce more artefacts and hallucination.
Syntax:--pass1-speech-enhancer <backend> or --pass1-speech-enhancer <backend>:<model>
GUI Parameter Customization
The GUI has three tabs:
Transcription Mode: Select pipeline, sensitivity, language
Advanced Options: Model override, scene detection method, debug settings
Two-Pass Ensemble: Configure both passes with full parameter customization via JSON editor
The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files.
AI Translation
Generate subtitles and translate them in one step:
# Generate and translate
whisperjav video.mp4 --translate
# Or translate existing subtitles
whisperjav-translate -i subtitles.srt --provider deepseek
Supports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, and OpenRouter.
Resume Support: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the .subtrans project file.
What Makes It Work for JAV
Scene Detection
Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word.
Three methods are available:
Auditok (default): Energy-based detection, fast and reliable
Silero: Neural VAD-based detection, better for noisy audio
Semantic (new in v1.7.4): Texture-based clustering using MFCC features, groups acoustically similar segments together
Voice Activity Detection (VAD)
Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments.
Follow the Prompts: The installer handles all dependencies (Python, FFmpeg, Git) automatically.
Launch: Open “WhisperJAV” from your Desktop shortcut.
Note: The first launch may take a few minutes as it initializes the engine. GPU is auto-detected; CPU-only mode is used if no compatible GPU is found.
Upgrading? Just run the new installer. Your AI models (~3GB), settings, and cached downloads will be preserved.
macOS (Apple Silicon & Intel)
Best for: M1/M2/M3/M4 users and Intel Mac users.
The install script auto-detects your Mac architecture and handles PyTorch dependencies automatically.
1. Install Prerequisites
# Install Xcode Command Line Tools (required for GUI)
xcode-select --install
# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install system tools
brew install python@3.11 ffmpeg git
GUI Requirement: The Xcode Command Line Tools are required to compile pyobjc, which enables the GUI. Without it, only CLI mode will work.
2. Install WhisperJAV
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
chmod +x installer/install_linux.sh
# Run the installer (auto-detects Mac architecture)
./installer/install_linux.sh
Intel Macs: The script automatically uses CPU-only mode. Expect slower processing (5-10x) compared to Apple Silicon with MPS acceleration.
Linux (Ubuntu/Debian/Fedora)
Best for: Servers, desktops with NVIDIA GPUs.
The install script auto-detects NVIDIA GPUs and installs the matching CUDA version.
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
chmod +x installer/install_linux.sh
# Standard Install (auto-detects GPU)
./installer/install_linux.sh
# Or force CPU-only (for servers without GPU)
./installer/install_linux.sh --cpu-only
Performance: A 2-hour video takes ~5-10 minutes on GPU vs ~30-60 minutes on CPU.
Advanced / Developer
Best for: Contributors and Python experts.
Manual pip install
Warning: Manual pip install is risky due to dependency conflicts (NumPy 2.x vs SciPy). We strongly recommend using the scripts above.
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
# Windows
installer\install_windows.bat --dev
# Mac/Linux
./installer/install_linux.sh --dev
# Or manual
pip install -e ".[dev]"
Windows source install
git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
installer\install_windows.bat # Auto-detects GPU
installer\install_windows.bat --cpu-only # Force CPU only
installer\install_windows.bat --cuda118 # Force CUDA 11.8
installer\install_windows.bat --cuda124 # Force CUDA 12.4
Prerequisites
Python 3.9-3.12 (3.13+ not compatible with openai-whisper)
FFmpeg in your system PATH
GPU recommended: NVIDIA CUDA, Apple MPS, or AMD ROCm
WhisperJAV
A subtitle generator for Japanese Adult Videos.
What is the idea:
Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.
1. The Acoustic Profile
JAV audio is defined by “acoustic hell” and a low Signal-to-Noise Ratio (SNR), characterized by:
2. Temporal Drift and Hallucination
While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes contextual drift and error accumulation. Specifically, extended periods of “ambiguous audio” (silence or rhythmic breathing) cause the Transformer’s attention mechanism to collapse, triggering repetitive hallucination loops where the model generates unrelated text to fill the acoustic void.
3. The Pre-processing Paradox & Fine-Tuning Risks
Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific log-Mel spectrogram features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in “domain shift” and erroneous transcriptions. Consequently, audio processing requires a “surgical,” multi-stage approach (like VAD clamping) rather than blanket filtering.
Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of overfitting. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent “hit or miss” quality outputs.
WhisperJAV is an attempt to address above failure points. The inference pipelines do:
no_speech_thresholdto systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g.,(moans)) from the final subtitle track [6, 7].Quick Start
GUI (Recommended for most users)
A window opens. Add your files, pick a mode, click Start.
Command Line
Features
Processing Modes
Sensitivity Settings
Transformers Mode (New in v1.7)
Uses HuggingFace’s
kotoba-tech/kotoba-whisper-v2.2model, which is optimized for Japanese conversational speech:Transformers-specific options:
--hf-model-id: Model (default:kotoba-tech/kotoba-whisper-v2.2)--hf-chunk-length: Seconds per chunk (default: 15)--hf-beam-size: Beam search width (default: 5)--hf-temperature: Sampling temperature (default: 0.0)--hf-scene: Scene detection method (none,auditok,silero,semantic)Two-Pass Ensemble Mode (New in v1.7)
Runs your video through two different pipelines and merges results. Different models catch different things.
Merge strategies:
smart_merge(default): Intelligent overlap detectionpass1_primary/pass2_primary: Prioritize one pass, fill gaps from otherfull_merge: Combine everything from both passesSpeech Enhancement tools (New in v1.7.3)
Pre-process audio scenes. When selected runs per-scene after scene detection. Note: Only use for surgical reasons. In general any audio processing that may alter mel-spectogram has the potential to introduce more artefacts and hallucination.
Available backends:
noneffmpeg-dsploudnorm,denoise,compress,highpass,lowpass,deessclearvoiceMossFormer2_SE_48K(default),FRCRN_SE_16Kzipenhancertorch(GPU),onnx(CPU)bs-roformervocals,otherSyntax:
--pass1-speech-enhancer <backend>or--pass1-speech-enhancer <backend>:<model>GUI Parameter Customization
The GUI has three tabs:
The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files.
AI Translation
Generate subtitles and translate them in one step:
Supports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, and OpenRouter.
Resume Support: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the
.subtransproject file.What Makes It Work for JAV
Scene Detection
Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word.
Three methods are available:
Voice Activity Detection (VAD)
Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments.
Japanese Post-Processing
Hallucination Removal
Whisper sometimes generates repeated text or phrases that weren’t spoken. WhisperJAV detects and removes these patterns.
Content-Specific Recommendations
Installation
Windows (Recommended)
Best for: Most users, beginners, and those who want a GUI.
.exe.Upgrading? Just run the new installer. Your AI models (~3GB), settings, and cached downloads will be preserved.
macOS (Apple Silicon & Intel)
Best for: M1/M2/M3/M4 users and Intel Mac users.
The install script auto-detects your Mac architecture and handles PyTorch dependencies automatically.
1. Install Prerequisites
2. Install WhisperJAV
Linux (Ubuntu/Debian/Fedora)
Best for: Servers, desktops with NVIDIA GPUs.
The install script auto-detects NVIDIA GPUs and installs the matching CUDA version.
1. Install System Dependencies
2. Install WhisperJAV
Advanced / Developer
Best for: Contributors and Python experts.
Manual pip install
1. Create Environment
2. Install PyTorch First (Critical)
You must install PyTorch before the main package to ensure hardware acceleration works.
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124pip install torch torchaudiopip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu3. Install WhisperJAV
Editable / Dev install
Use this if you plan to modify the code.
Windows source install
Prerequisites
Detailed Windows Prerequisites
NVIDIA GPU Setup
FFmpeg
C:\ffmpegC:\ffmpeg\binto your PATHPython
Download from python.org. Check “Add Python to PATH” during installation.
CLI Reference
Run
whisperjav --helpfor all options.Troubleshooting
FFmpeg not found: Install FFmpeg and add it to your PATH.
Slow processing / GPU warning: Your PyTorch might be CPU-only. Reinstall with GPU support:
model.bin error in faster mode: Enable Windows Developer Mode or run as Administrator, then delete the cached model folder:
Performance
Rough estimates for processing time per hour of video:
Contributing
Contributions welcome. See
CONTRIBUTING.mdfor guidelines.License
MIT License. See LICENSE file.
Citation and credits
Acknowledgments
Disclaimer
This tool generates accessibility subtitles. Users are responsible for compliance with applicable laws regarding the content they process.