ControlVC Wrapper Setup Guide
Quick Setup (Recommended)
Run the provided setup script — it handles everything automatically:
Then activate the environment and verify:
Expected output ends with All tests passed! ✓. Listen to test_output.wav — it should be clear, intelligible speech with a different voice from the original speaker.
Custom control-vc path (if you already have the repo elsewhere):
CONTROL_VC_DIR env var is also respected by scripts/test_wrapper.py:
Manual Setup
Use this if scripts/setup.sh fails at a specific step, or if you need more control.
Prerequisites
- Python 3.10 (required — fairseq 0.12.2 is incompatible with Python 3.11+)
- PyTorch (CPU is fine; CUDA optional for GPU acceleration)
If you use pyenv, the project includes a .python-version file that selects 3.10 automatically.
Step 1: Clone Control-VC Repository
Use the zuruoke fork, which removes pinned library versions to allow newer torchaudio/librosa:
Step 2: Download Checkpoints
Download all required checkpoints from the ControlVC Google Drive: https://drive.google.com/drive/folders/1APVHQFIb1871UhvymdK_oewWKJWrInYK
Place them in ~/repos/control-vc/checkpoints/:
checkpoints/
├── embed_f0stat2/
│ ├── config.json
│ └── g_00400000 # (or g_00350000)
├── 3000000-BL.ckpt # Required: speaker embedding model
├── hubert_base_ls960.pt # Strongly recommended (~1.1 GB)
├── km.bin # Strongly recommended
└── vctk_f0_vq/
└── g_00400000
Without hubert_base_ls960.pt and km.bin, the wrapper falls back to dummy content codes and produces garbled audio.
Step 3: Apply Fix for Apple Silicon (MPS)
If you are on an M1/M2/M3 Mac, edit ~/repos/control-vc/fairseq_feature_reader.py lines 22–26:
Before:
DEVICE = torch.device("cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu")
After:
# Force CPU - MPS does not support all fairseq operators
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Step 4: Install Dependencies
cd ~/UVM-plaid/dp-vc
# Create and activate a Python 3.10 virtual environment
python3.10 -m venv .venv310
source .venv310/bin/activate
# Install dp-vc package in editable mode
pip install -e .
# fairseq requires an older pip to handle omegaconf metadata
pip install 'pip<24.1'
pip install 'omegaconf==2.0.6'
pip install git+https://github.com/facebookresearch/fairseq.git@v0.12.2
# Audio processing dependencies
pip install torchaudio librosa soundfile joblib amfm_decompy matplotlib gdown
Note on fairseq: version 0.12.2 is required for HuBERT support. It requires Python 3.10 — Python 3.11+ changed dataclass validation in a way that breaks fairseq's internal configs.
Step 4b: Patch fairseq for PyTorch 2.6+
PyTorch 2.6 changed torch.load to use weights_only=True by default, which blocks loading the HuBERT checkpoint (it contains argparse.Namespace). Apply this one-line fix after installation:
python - <<'EOF'
import site, pathlib
for d in site.getsitepackages():
f = pathlib.Path(d) / "fairseq/checkpoint_utils.py"
if f.exists():
txt = f.read_text()
old = "torch.load(f, map_location=torch.device(\"cpu\"))"
new = "torch.load(f, map_location=torch.device(\"cpu\"), weights_only=False)"
if old in txt:
f.write_text(txt.replace(old, new))
print(f"Patched {f}")
else:
print(f"Already patched or not found in {f}")
EOF
Step 5: Verify the Setup
Expected output:
==================================================
Testing ControlVC Wrapper
==================================================
1. Initializing wrapper...
[ControlVC] Loaded generator: g_00350000.pth
[ControlVC] Loaded speaker model: 3000000-BL.ckpt
✓ Initialization successful!
2. Testing embedding extraction with examples/trump_0.wav...
✓ Embedding extracted: shape=torch.Size([256, 1])
3. Testing voice conversion...
[ControlVC] Loaded HuBERT and K-means models
✓ Conversion successful: shape=torch.Size([1, 124480])
✓ Saved output to test_output.wav
==================================================
All tests passed! ✓
==================================================
Listen to test_output.wav — it should be clear, intelligible speech. Garbled audio means HuBERT/k-means is not loading correctly.
Use with Differential Privacy
from pathlib import Path
from dpvc import ControlVCWrapper, Anonymizer
vc_wrapper = ControlVCWrapper(
repo_root=Path("~/repos/control-vc").expanduser(),
device="cpu"
)
# Pass the ControlVC-specific VAE checkpoint
anonymizer = Anonymizer(vc_wrapper, vae_checkpoint_path="examples/controlvc_vae.pt")
anonymizer.anonymize(
source_file="examples/trump_0.wav",
output_file="output_anonymized.wav",
noise_level=1.0 # higher = more privacy, less speaker diversity
)
Note on noise_level: At high values (e.g., 2.0), the DP noise dominates the 6-dimensional latent space and all outputs converge to a similar-sounding anonymous voice. Lower values (0.5–1.0) preserve more speaker-to-speaker variation while still providing privacy.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "Using dummy content codes" / garbled audio | HuBERT or km.bin not found | Verify checkpoints; check import fairseq works |
omegaconf has invalid metadata |
pip >= 24.1 | pip install 'pip<24.1' before fairseq |
| MPS device error on Apple Silicon | fairseq_feature_reader.py uses MPS | Apply Step 3 fix (or re-run scripts/setup.sh) |
| Output sounds the same every run | noise_level too high |
Try noise_level=0.5 |
gdown fails during setup |
Google Drive quota exceeded | Download checkpoints manually (Step 2) |