Official implementation for the ECCV 2024 paper, A Simple Baseline for Spoken Language to Sign Language Translation with 3D Avatars.
The objective of this paper is to develop a functional system for translating spoken languages into sign languages, referred to as Spoken2Sign translation. The Spoken2Sign task is orthogonal and complementary to traditional sign language to spoken language (Sign2Spoken) translation. To enable Spoken2Sign translation, we present a simple baseline consisting of three steps: 1) creating a gloss-video dictionary using existing Sign2Spoken benchmarks; 2) estimating a 3D sign for each sign video in the dictionary; 3) training a Spoken2Sign model, which is composed of a Text2Gloss translator, a sign connector, and a rendering module, with the aid of the yielded gloss-3D sign dictionary. The translation results are then displayed through a sign avatar.
Please run
pip install -r requirements.txt
Or you may use docker (strongly recommended):
sudo docker pull rzuo/pose:sing_ISLR_smplx
sudo docker run --gpus all -v /your_data:/data -v /your_code:/workspace -v /your_models:/pretrained_models --name sing_ISLR_smplx --ipc=host -it rzuo/pose:sing_ISLR_smplx /bin/bash
Phoenix-2014T: Please follow https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX-2014-T/.
CSL-Daily: Please follow http://home.ustc.edu.cn/~zhouh156/dataset/csl-daily/.
Note that all raw videos need to be zipped.
WLASL: Please follow https://dxli94.github.io/WLASL/.
MSASL: Please follow https://www.microsoft.com/en-us/research/project/ms-asl/.
Note that all raw videos need to be zipped.
We use HRNet pre-trained on COCO-WholeBody. Details of the extraction process can be refered to this work.
We use a pre-trained CSLR model, TwoStream-SLR, to segment continuous sign videos into a set of isolated sign clips. The pre-trained model checkpoints can be downloaded here. After that, put the checkpoints into the folder data/results. Then run
python gen_segment.py
The segmented signs for Phoenix-2014T and CSL-Daily can be downloaded here.
Please download the models (mano, smpl, smplh, and smplx) from here and unzip them into ../../data/models. The structure should be ../../data/mano, ../../data/smpl, and so on.
Rendering 3D avatars relies on Blender (you don't need to download it if you use my docker image) and SMPL-X add-on. Related add-ons are available here. Please put the them into ../../pretrained_models/.
You can download the estimated 3D dictionary (link) and video IDs (link).
To fit videos to the SMPL-X model, run:
python -m torch.distributed.launch --nproc_per_node 8 --master_port 29999 --use_env smplifyx/main.py --config=cfg_files/fit_smplsignx_phoenix.yaml --init_idx=0 --num_per_proc=1500
It is based on the gloss2text part of TwoStream-SLT. Required checkpoints and embeddings are available at [for Phoenix-2014T] and [for CSL-Daily].
cd text2gloss
config_file='configs/T2G.yaml'
python -m torch.distributed.launch --nproc_per_node 8 --master_port 29999 --use_env training.py --config=${config_file}
Then save the T2G predictions:
python prediction.py --config=${config_file}
To assign a confidence score to each 3D sign, we follow a model-based design: we first train an isolated sign language recognition model taking synthesized 2D keypoints as inputs, and then select the sign with the highest probability as the representative sign for a target gloss. Training details can be found in this repo.
To train the sign connector, please run:
python make_connector_dataset.py && python smplifyx/sign_connector_train.py
First run prediction.py for the text2gloss translator to store gloss predictions on the dev and test sets:
python text2gloss/prediction.py --config=text2gloss/configs/T2G.yaml
Then achieve spoken2sign translation by running:
python motion_gen.py --config cfg_files/fit_smplsignx.yaml --num_per_proc=1
The above command would generate a sequence of motions represented by SMPL-X parameters. "num_per_proc" denotes the number of videos you want to generate.
To further visualize the motions as avatars, please run:
conda create --name blender python=3.10 && conda activate blender && pip install bpy==3.4.0 --user && pip install tqdm --user && pip install ConfigArgParse --user && pip install PyYAML --user
python render_avatar.py --config=cfg_files/fit_smplsignx.yaml --num_per_proc=1
The back-translation is based on projected 2D keypoints for a fair comparison with previous works:
python smplx2kps.py --config cfg_files/fit_smplsignx_phoenix.yaml
Just train a sign language translation model (keypoint-only) taking synthesized 2D keypoints as inputs. More details can be found in this repo.
