MolecularDiffusion.modules.models.tabasco.chem.convert

Attributes

log

Classes

MoleculeConverter

Bidirectional converter between RDKit molecules and TensorDicts.

Module Contents

class MolecularDiffusion.modules.models.tabasco.chem.convert.MoleculeConverter(atom_names=ATOM_NAMES, atom_color_map=ATOM_COLOR_MAP, dataset_normalizer=2.0)

Bidirectional converter between RDKit molecules and TensorDicts.

Coordinates are optionally centred and divided by dataset_normalizer to improve numerical stability for learning tasks.

Args: atom_names: Allowed element symbols; ‘*’ is treated as a dummy. atom_color_map: Parallel list of RGB colour triples. dataset_normalizer: Value used to scale coordinates; see

to_tensor / from_tensor.

data_to_atom_array(mol_tensor: tensordict.TensorDict, rescale_coords: bool = True, add_bonds=True, add_hydrogens=True, sanitize=True) biotite.structure.AtomArray

Shortcut: TensorDict -> RDKit Mol -> Biotite AtomArray.

from_batch(batch: tensordict.TensorDict, **kwargs) List[rdkit.Chem.Mol]

Vectorised wrapper around from_tensor for batched data.

Unconvertible items are returned as None and logged as warnings.

from_tensor(mol_tensor: tensordict.TensorDict, rescale_coords: bool = True, sanitize: bool = True, use_openbabel: bool = True)

Inverse of to_tensor.

Parameters:
  • mol_tensor – Unbatched TensorDict.

  • rescale_coords – Multiply coords back by dataset_normalizer.

  • sanitize – Run Chem.SanitizeMol; may fail on exotic molecules.

  • use_openbabel – Toggle OpenBabel bond inference.

Returns:

RDKit Mol or None if any step fails.

mol_to_atom_array(mol: rdkit.Chem.Mol) biotite.structure.AtomArray

Convert an RDKit Mol to a Biotite AtomArray (with bonds).

tensor_obj_to_points(tensor_obj: tensordict.TensorDict) Tuple[torch.Tensor, torch.Tensor]

Return (coords, atom_type_idx) with padding rows removed.

to_tensor(mol: rdkit.Chem.Mol, pad_to_size: int | None = None, normalize_coords: bool = True, remove_hydrogens: bool = True) tensordict.TensorDict

Convert an RDKit mol to a TensorDict.

Parameters:
  • mol – Input molecule with 3-D conformer.

  • pad_to_size – If given, output is padded to this atom count.

  • normalize_coords – If True, centre of mass is removed and divided by dataset_normalizer.

  • remove_hydrogens – If True, strip explicit H atoms.

Returns:

  • coords: (N, 3) float32

  • atomics: (N, n_elements) one-hot

  • padding_mask: (N,) bool (optional)

Return type:

TensorDict with keys

dataset_normalizer = 2.0
MolecularDiffusion.modules.models.tabasco.chem.convert.log