johntsi commited on
Commit
cd42135
1 Parent(s): 4b39cbd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -3
README.md CHANGED
@@ -245,17 +245,23 @@ tags:
245
 
246
  ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
247
 
248
- The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while using only ASR data. It thus enables **Speech Translation to all the 200 languages supported by NLLB**. The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.
249
 
250
  For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
251
 
252
- This version of ZeroSwot is trained with ASR data from CommonVoice, and adapting [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model.
 
 
253
 
254
  <div align=center><img src="resources/methodology.png" height="100%" width="100%"/></div>
255
 
 
 
 
 
256
  ## Usage
257
 
258
- The usage is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.
259
 
260
  ```bash
261
  pip install transformers torchaudio sentencepiece
 
245
 
246
  ZeroSwot is a state-of-the-art zero-shot end-to-end Speech Translation system.
247
 
248
+ The model is created by adapting a wav2vec2.0-based encoder to the embedding space of NLLB, using a novel subword compression module and Optimal Transport, while only utilizing ASR data. It thus enables **Zero-shot E2E Speech Translation to all the 200 languages supported by NLLB**.
249
 
250
  For more details please refer to our [paper](https://arxiv.org/abs/2402.10422) and the [original repo](https://github.com/mt-upc/ZeroSwot) build on fairseq.
251
 
252
+ ## Architecture
253
+
254
+ The compression module is a light-weight transformer that takes as input the hidden state of wav2vec2.0 and the corresponding CTC predictions, and compresses them to subword-like embeddings similar to those expected from NLLB and aligns them using Optimal Transport. For inference we simply pass the output of the speech encoder to NLLB encoder.
255
 
256
  <div align=center><img src="resources/methodology.png" height="100%" width="100%"/></div>
257
 
258
+ ## Version
259
+
260
+ This version of ZeroSwot is trained with ASR data from CommonVoice, and adapted [wav2vec2.0-large](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) to the [nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) model.
261
+
262
  ## Usage
263
 
264
+ The model is tested with python 3.9.16 and Transformer v4.41.2. Install also torchaudio and sentencepiece for processing.
265
 
266
  ```bash
267
  pip install transformers torchaudio sentencepiece