Models Directory

Welcome to the Open Source Models Directory! This document provides a comprehensive list of open-source models for image generation, audio generation, video generation, and Large Language Models (LLMs) supported by FlexStack.

Large Language Models (LLMs)

Model Name

Description

Document

Gemma-7B-IT

Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. Developed by Google DeepMind and other teams across Google, Gemma is named after the Latin gemma, meaning "precious stone".

Link

Mixtral-7B

The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters.

Link

Text Embeddings

Model Name

Description

Document

GTE-Large

The GTE (General Text Embedding) models, crafted by Alibaba DAMO Academy, are advanced text embedding models featuring a multi-stage contrastive learning approach. They're trained using a diverse mixture of datasets from multiple sources, including web pages, academic papers, social media, and code repositories. This model is particularly noted for its performance in a range of NLP and code-related tasks despite its modest parameter count of 110M.

Link

Mistral-Embedding

The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters.

Link

Image Generation Models

Model Name

Description

Document

Stable Diffusion 1.5

The Stable-Diffusion-v1-5 checkpoint was initialized with the weights of the Stable-Diffusion-v1-2 checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.

Link

Stable Diffusion XL

With Stable Diffusion XL you can now make more realistic images with improved face generation, produce legible text within images, and create more aesthetically pleasing art using shorter prompts.

Link

Stable Diffusion XL-Lightning

SDXL-Lightning is a lightning-fast text-to-image generation model. It can generate high-quality 1024px images in a few steps. For more information, please refer to our research paper: SDXL-Lightning: Progressive Adversarial Diffusion Distillation. We open-source the model as part of the research.

Link

Audio Generation Models

Model Name

Description

Document

AudioGen

AudioGen is an autoregressive transformer LM that synthesizes general audio conditioned on text (Text-to-Audio). Internally, AudioGen operates over discrete representations learnt from the raw waveform, using an EnCodec tokenizer.

AudioGen was presented at AudioGen: Textually Guided Audio Generation

Link

MusicGen

The MusicGen model was proposed in the paper Simple and Controllable Music Generation.

MusicGen is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.

Link

Suno/Bark

Bark is a transformer-based text-to-audio model created by Suno. Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying. To support the research community, we are providing access to pretrained model checkpoints ready for inference.

Link

Video Generation Models

Model Name

Description

Document

Damo Video Synthesis

The text-to-video generation diffusion model consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. The overall model parameters are about 1.7 billion. Support English input. The diffusion model adopts the Unet3D structure, and realizes the function of video generation through the iterative denoising process from the pure Gaussian noise video.

Link

PreviousAI Stack architecture NextOpen Source AI Demo

Last updated 1 year ago

Was this helpful?