網頁

2026年4月17日 星期五

DGX Spark 安裝 VibeVoice

https://github.com/microsoft/VibeVoice/tree/main
但發現 TTS (語音合成) 的核心實作代碼已被移除。
https://github.com/vibevoice-community/VibeVoice/tree/main
https://github.com/dontriskit/VibeVoice-FastAPI/tree/main

$ git clone https://github.com/microsoft/VibeVoice.git
$ cd VibeVoice
$ uv venv --python 3.12
$ source .venv/bin/activate
$ vi pyproject.toml 移除 dependencies 下的 torch
$ uv sync
$ uv pip install --force-reinstall torch \
  --index-url https://download.pytorch.org/whl/cu130
$ uv pip install --upgrade setuptools wheel pip ninja
$ MAX_JOBS=2 uv pip install flash-attn --no-build-isolation

microsoft/VibeVoice-ASR

  --model_path microsoft/VibeVoice-Realtime-0.5B \
  --model_path microsoft/VibeVoice-1.5B \ xxx
$ python demo/realtime_model_inference_from_file.py \
  --model_path microsoft/VibeVoice-1.5B \
  --speaker_name Carter \
  --txt_path demo/text_examples/1p_vibevoice.txt

$ vi demo/vibevoice_asr_inference_from_file.py 將 dtype 改成 torch_dtype
        self.model = VibeVoiceASRForConditionalGeneration.from_pretrained(
            model_path,
            dtype=dtype,
            device_map=device if device == "auto" else None,
            attn_implementation=attn_implementation,
            trust_remote_code=True
        )
$ python demo/vibevoice_asr_inference_from_file.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_files=outputs/1p_vibevoice_generated.wav 


======================================
https://github.com/microsoft/VibeVoice/tree/main
但發現 TTS (語音合成) 的核心實作代碼已被移除。
https://github.com/vibevoice-community/VibeVoice/tree/main
https://github.com/dontriskit/VibeVoice-FastAPI/tree/main

$ git clone https://github.com/vibevoice-community/VibeVoice.git VibeVoice-community
$ cd VibeVoice-community
$ uv venv --python 3.12
$ source .venv/bin/activate
$ vi pyproject.toml 移除 dependencies 下的 torch, requires-python 改成 3.10
$ uv sync
$ uv pip install --force-reinstall torch \
  --index-url https://download.pytorch.org/whl/cu130
$ uv pip install --upgrade setuptools wheel pip ninja
$ MAX_JOBS=4 uv pip install flash-attn --no-build-isolation

#  --model_path vibevoice/VibeVoice-1.5B \
#  --model_path microsoft/VibeVoice-1.5B \ 也可使用
#  多人語音,單 speaker 可以,但多 speaker 會不清楚
$ python demo/inference_from_file.py \
  --model_path /mnt/480SSD/models/VibeVoice-1.5B \
  --speaker_name Xinran \
  --txt_path demo/text_examples/2p_yayi.txt

$ vi vibevoice/modular/modeling_vibevoice_asr.py 將 is_final_chunk 註解掉
                    # Encode chunk for acoustic tokenizer (don't sample yet)
                    acoustic_encoder_output = self.model.acoustic_tokenizer.encode(
                        chunk.unsqueeze(1),
                        cache=acoustic_encoder_cache,
                        sample_indices=sample_indices,
                        use_cache=True,
                        is_final_chunk=is_final,
                    )
                    acoustic_mean_segments.append(acoustic_encoder_output.mean)
                    
                    # Encode chunk for semantic tokenizer (take mean directly)
                    semantic_encoder_output = self.model.semantic_tokenizer.encode(
                        chunk.unsqueeze(1),
                        cache=semantic_encoder_cache,
                        sample_indices=sample_indices,
                        use_cache=True,
                        is_final_chunk=is_final,
                    )
                    semantic_mean_segments.append(semantic_encoder_output.mean)
$ vi vibevoice/modular/modular_vibevoice_text_tokenizer.py
參考 https://github.com/microsoft/VibeVoice/tree/main 的檔案,加入 VibeVoiceASRTextTokenizerFast 部分程式碼

$ python demo/vibevoice_asr_inference_from_file.py \
  --model_path /mnt/480SSD/models/VibeVoice-ASR \
  --audio_files=outputs/2p_yayi_generated.wav

$ python demo/vibevoice_asr_inference_from_file.py \
  --model_path /mnt/480SSD/models/VibeVoice-ASR \
  --audio_files=outputs/1p_Ch2EN_generated.wav

$ python demo/server.py
$ curl http://localhost:8100/v1/models
$ curl http://localhost:8100/v1/audio/voices
$ curl http://localhost:8100/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/home/spark/DiskD/audio_llm/GPT-SoVITS/GPT-SoVITS/samples/output.wav" \
  -F "model=whisper-1"
$ curl http://localhost:8100/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "大家好,今天我們要測試 OpenAI 的語音合成功能。",
    "voice": "Xinran"
  }' \
  --output speech.wav
$ curl http://localhost:8100/v1/audio/voices

==========================================
建立安裝 wheel
$ uv pip install build
$ python -m build --wheel
$ ls dist
$ uv pip install path_to/VibeVoice-community/dist/vibevoice-0.1.0-py3-none-any.whl

$ python demo/tts_server.py

沒有留言:

張貼留言