Language Models for Molecule Generation

We build AI-enhanced algorithms for molecular design with the goal of reaching superhuman performance. Our aim is to create systems that generate drug candidate molecules with properties specified by medicinal chemists, either in a single step or through iterative optimization. We have developed language models that understand 2D molecular graphs and basic molecular properties, and we combined them with evolutionary algorithms to generate property-conditioned molecules beyond existing databases. We are currently extending these models with 3D molecular understanding and developing more realistic, practically relevant benchmarks for molecular design.

2025

Scaling Laws for LLM-based Molecular Optimization Algorithms

We show that evolutionary algorithms enhanced with LLM-based molecule generation can scale for simple molecular optimization tasks in both directions: LLM size and number of optimization steps

2025

Towards Molecular Conformer Generation with Language Models

We trained language models that directly generate the 3D structures of drug-like molecules. We also show improvements with the scale of the models.

2024

Small Molecule Optimization with Large Language Models

Up to 2B parameter language models (Chemlactica and Chemma) combined with a genetic algorithm produces state-of-the-art results on most molecular optimization benchmarks.

2024

BARTSmiles: large-scale generative masked language models for molecular representations

A BART-like encoder-decoder model trained on 1.7 billion SMILES. Demonstrated competitive performance after fine-tuning on property prediction, chemical reaction prediction and retrosynthesis tasks.

2022

Improved molecular representations for property prediction based on VAEs

We show that property predictors operating on the latent space of VAEs can improve the downstream performance on related property prediction tasks.