SOTASTREAM: A Streaming Approach to Machine Translation Training

Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.

Paper and LLMs Machine Translation management

GitHub Link

The GitHub link is https://github.com/marian-nmt/sotastream

Introduce

"Sotastream is a data augmentation tool designed for training pipelines. It utilizes infinibatch to create a continuous stream of shuffled training data and offers real-time data manipulation, augmentation, mixing, and sampling. It can be installed from PyPI or GitHub and provides entry points for both module and command-line usage. Developers are encouraged to use the editable mode for direct code edits. The tool supports various usage examples and pipeline options. Sotastream's development is led by the TextMT Team at Microsoft Translator." Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.

Content

Sotastream is a tool for data augmentation for training pipeline. It uses infinibatch internally to generate an infinite stream of shuffled training data and provides a means for on-the-fly data manipulation, augmentation, mixing, and sampling.

Alternatives & Similar Tools

Free Google Gemini: the best largest and most capable AI model Free

Google Gemini, a multimodal AI by DeepMind, processes text, audio, images, and more. Gemini outperforms in AI benchmarks, is optimized for varied devices, and has been tested for safety and bias, adhering to responsible AI practices.

Visit →

Video ReTalking-focuses on audio-based lip synchronization for talking head video editing Open Source

Video ReTalking, advanced real-world talking head video according to input audio, producing a high-quality

Visit →

UniSim-Chat Control Video and Virtual simulation Open Source

Then transplant it to the real world to solve complex problems

Visit →

LongLLaMA-handle very long text contexts, up to 256,000 tokens Open Source

LongLLaMA is a large language model designed to handle very long text contexts, up to 256,000 tokens. It's based on OpenLLaMA and uses a technique called Focused Transformer (FoT) for training. The repository provides a smaller 3B version of LongLLaMA for free use. It can also be used as a replacement for LLaMA models with shorter contexts.

Visit →

LLaVA-LLMs designed to connect a vision encoder with a language model Open Source

Large Language and Vision Assistant

Visit →

Ntropy Insights- Save 80% on underwriting businesses everywhere Freemium

Use bank data and Ntropy's AI. Parse bank feeds and statements, extract revenue and COGs, automatically re-create a P&L within milliseconds. Any industry, any geo.

Visit →