gLM2_650M
gLM2 is a mixed-modality genomic language model, trained on the OMG Dataset
.
The model encodes a genomic scaffold with both both amino-acid and DNA tokens.
gLM2 is trained at two scales: 150M (available at tattabio/gLM2_150M
) and 650M parameters.
See https://github.com/TattaBio/gLM2 for inference scripts.
Model Description
gLM2 is a transformer encoder trained with the masked language modeling objective.
It encodes a genomic contig as a sequence of protein coding sequences (CDS) and DNA inter-genic sequences (IGS).
CDS elements are tokenized using per-amino acid tokens, and IGS elements are tokenized using byte-pair encoding with a vocabulary size of 4,096.
- To encode the genomic strand, we prepended each genomic element with a special token, either
<+>
or<->
to indicate the positive and negative strands. - To avoid collision between amino acid and nucleotide tokens, the tokenizer expects all amino acids to be uppercase, and all nucleotides to be lowercase.
Getting Started
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('tattabio/gLM2_650M', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained('tattabio/gLM2_650M', trust_remote_code=True)
# A contig with two proteins and an inter-genic sequence.
# NOTE: Nucleotides should always be lowercase, and prepended with `<+>`.
sequence = "<+>MALTKVEKRNRIKRRVRGKISGTQASPRLSVYKSNK<+>aatttaaggaa<->MLGIDNIERVKPGGLELVDRLVAVNRVTKVTKGGRAFGFSAIVVVGNED"
# Tokenize the sequence.
encodings = tokenizer([sequence], return_tensors='pt')
# Extract embeddings.
with torch.no_grad():
embeddings = model(encodings.input_ids.cuda(), output_hidden_states=True).last_hidden_state
Training Data
gLM2 is trained on the OMG
dataset.
To improve the dataset balance and remove near-duplicate examples, the data is tokenized and pruned by applying Semantic Deduplication SemDedup.
We use an embedding distance threshold of 1e-3, resulting in 42% of the dataset being pruned.
Training Details
- Pretraining tokens: 275B
- Context length: 2048
- Masking rate: 30%
- Learning rate: 5e-4
- Optimizer: AdamW (betas = (0.9, 0.95))
- Mixed precision training: bfloat16
- Weight decay: 0.1
Citation
BibTeX:
TODO
- Downloads last month
- 440