Foice: Can I Hear Your Face? Pervasive Attack on Voice Authentication Systems with a Single Face Image
This repository provides a PyTorch implementation of Foice.
Foice is a generative text-to-speech model that generates multiple synthetic audios from just a single image of the person’s face, without requiring any voice sample.
Feel free to check out our demo video👉: https://drive.google.com/file/d/1Be1fgyDookg839UyV7DJbdBgx-YlD9ge/view?usp=sharing
- face_alignment
pip install face-alignment - numpy
- cv2
- torch
- torchvision
| Face-dependent Voice Feature Extractor | Face-independent Voice Feature Generator |
|---|---|
| link | link |
Foice reuses the synthesizer and vocoder from SV2TTS. You can find the pre-trained synthesizer using the link.
Put all pre-trained models in the folder "../F2V_models/".
Run End-to-End.ipynb to generate voice recordings from image.
- Image processing - face alignment: face_alignment
- Backbone text-to-speech model: SV2TTS
- Add pre-trained model
- Add training process