MolecularDiffusion.data.component.dataset

Attributes

Classes

Module Contents

class MolecularDiffusion.data.component.dataset.GraphDataset

Bases: torch.utils.data.Dataset

atom_types()

All atom types.

get_item(index)
get_property(task)
load_csv(csv_file: str, xyz_dir: str, xyz_field: str = 'xyz', smiles_field: str = 'smiles', target_fields: List[str] | None = None, atom_vocab: List[str] = [], node_feature_choice: str | None = None, forbidden_atoms: List[str] = [], verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs)

Load the dataset from a csv file.

Parameters:
  • csv_file (str) – file name

  • xyz_dir (str) – directory to store XYZ files

  • xyz_field (str) – name of the XYZ column in the table

  • smiles_field (str, optional) – name of the SMILES column in the table. Use None if there is no SMILES column.

  • target_fields (list of str, optional) – name of target columns in the table. Default is all columns other than the SMILES column.

  • atom_vocab (list of str, optional) – atom types

  • node_feature_choice (str, optional) – geom features to extract

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • verbose (int, optional) – output verbose level

  • **kwargs

load_db(db_path: str, atom_vocab: List[str] = [], node_feature_choice: List[str] | None = None, target_fields: List[str] | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs: Any)

Load the dataset from an ASE db file.

Parameters:
  • db_path (str) – path to ASE db file

  • atom_vocab (list of str, optional) – atom types

  • node_feature_choice (list of str, optional) – RDKit atom features to extract

  • target_fields (list of str, optional) – name of target columns in the table.

  • transform (Callable, optional) – data transformation function

  • max_atom (int, optional) – maximum number of atoms in a molecule

  • with_hydrogen (bool, optional) – whether to add hydrogen atoms

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • edge_type (str, optional) – type of edge to construct the graph (default: distance, neighbor)

  • radius (float, optional) – radius to construct the graph (default: 4.0)

  • n_neigh (int, optional) – number of neighbors to consider (default: 5)

  • verbose (int, optional) – output verbose level

  • null_value (float, optional) – null value for missing context data

  • **kwargs

load_npy(coords: torch.Tensor, natoms: torch.Tensor, smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs: Any)

Load the dataset from npy tensors.

Parameters:
  • coords (tensor) – tensor of coordinates [total_atoms, 5] with [mol_idx, Z, x, y, z]

  • natoms (tensor) – tensor of number of atoms per molecule

  • smiles_list (list of str) – SMILES strings

  • targets (dict of list) – prediction targets

  • atom_vocab (list of str) – atom types

  • node_feature_choice (str, optional) – geom features to extract

  • transform (Callable, optional) – data transformation function

  • max_atom (int, optional) – maximum number of atoms in a molecule

  • with_hydrogen (bool, optional) – whether to include hydrogen atoms

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • edge_type (str, optional) – type of edge to construct the graph

  • radius (float, optional) – radius to construct the graph

  • n_neigh (int, optional) – number of neighbors to consider

  • verbose (int, optional) – output verbose level

  • **kwargs

load_pickle(pkl_file, verbose=0)

Load the dataset from a pickle file.

Parameters:
  • pkl_file (str) – file name

  • verbose (int, optional) – output verbose level

load_smiles()
load_xyz(xyz_list: List[str], smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs: Any)

Load the dataset from XYZ and targets.

Parameters:
  • xyz_list (list of str) – XYZ file names

  • smiles_list (list of str) – SMILES strings

  • targets (dict of list) – prediction targets

  • atom_vocab (list of str) – atom types

  • node_feature_choice (str, optional) – geom features to extract

  • transform (Callable, optional) – data transformation function

  • max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)

  • with_hydrogen (bool, optional) – whether to add hydrogen atoms

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • edge_type (str, optional) – type of edge to construct the graph (default: distance, neighbor)

  • radius (float, optional) – radius to construct the graph (default: 4.0)

  • n_neigh (int, optional) – number of neighbors to consider (default: 5)

  • verbose (int, optional) – output verbose level

  • **kwargs

save_pickle(pkl_file, verbose=0)

Save the dataset to a pickle file.

Parameters:
  • pkl_file (str) – file name

  • verbose (int, optional) – output verbose level

property num_atom_type

Number of different atom types.

property num_atoms

Number of atoms in each molecule.

property tasks

List of tasks.

class MolecularDiffusion.data.component.dataset.PointCloudDataset

Bases: torch.utils.data.Dataset

atom_types()

All atom types.

get_item(index)
get_property(task)
get_tabasco_stats()

Get dataset statistics required for TABASCO unconditional sampling.

Returns:

  • max_atoms: Maximum number of atoms in dataset

  • num_atom_types: Number of atom types in vocabulary

  • atom_count_histogram: Histogram of molecule sizes

  • all_smiles: List of all SMILES strings

Return type:

Dictionary with

load_csv(csv_file, xyz_dir, xyz_field='xyz', smiles_field='smiles', target_fields=None, atom_vocab=[], node_feature_choice=None, forbidden_atoms=[], null_value=math.nan, verbose=0, allow_unknown=False, use_ohe_feature=True, **kwargs)

Load the dataset from a csv file.

Parameters:
  • csv_file (str) – file name

  • xyz_dir (str) – directory to store XYZ files

  • xyz_field (str) – name of the XYZ column in the table

  • smiles_field (str, optional) – name of the SMILES column in the table. Use None if there is no SMILES column.

  • target_fields (list of str, optional) – name of target columns in the table. Default is all columns other than the SMILES column.

  • atom_vocab (list of str, optional) – atom types

  • node_feature_choice (str, optional) – geom features to extract

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • null_value (str, optional) – null value for missing targets

  • verbose (int, optional) – output verbose level

  • **kwargs

load_db(db_path: str, atom_vocab: List[str] = [], node_feature_choice: List[str] | None = None, target_fields: List[str] | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, null_value=math.nan, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs: Any)

Load the dataset from an ASE db file.

Parameters:
  • db_path (str) – path to ASE db file

  • atom_vocab (list of str, optional) – atom types

  • node_feature_choice (list of str, optional) – RDKit atom features to extract

  • target_fields (list of str, optional) – name of target columns in the table.

  • transform (Callable, optional) – data transformation function

  • max_atom (int, optional) – maximum number of atoms in a molecule

  • with_hydrogen (bool, optional) – whether to add hydrogen atoms

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • pad_data (bool, optional) – whether to pad data to max_atom)

  • verbose (int, optional) – output verbose level

  • null_value (float, optional) – null value for missing context data

  • **kwargs

load_npy(coords: torch.Tensor, natoms: torch.Tensor, smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs: Any)

Load the dataset from npy and targets.

Parameters:
  • coords (tensor) – tensor of coordinates

  • natoms (tensor) – tensor of number of atoms

  • smiles_list (list of str) – SMILES strings

  • targets (dict of list) – prediction targets

  • atom_vocab (list of str) – atom types

  • node_feature_choice (str, optional) – geom features to extract

  • transform (Callable, optional) – data transformation function

  • max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)

  • with_hydrogen (bool, optional) – whether to add hydrogen atoms

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • pad_data (bool, optional) – whether to pad data to max_atom

  • verbose (int, optional) – output verbose level

  • **kwargs

load_pickle(pkl_file, verbose=0, cheap_data=False)

Load the dataset from a pickle file.

Parameters:
  • pkl_file (str) – file name

  • verbose (int, optional) – output verbose level

load_xyz(xyz_list: List[str], smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs: Any)

Load the dataset from XYZ and targets.

Parameters:
  • xyz_list (list of str) – XYZ file names

  • smiles_list (list of str) – SMILES strings

  • targets (dict of list) – prediction targets

  • atom_vocab (list of str) – atom types

  • node_feature_choice (str, optional) – geom features to extract

  • transform (Callable, optional) – data transformation function

  • max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)

  • with_hydrogen (bool, optional) – whether to add hydrogen atoms

  • pad_data (bool, optional) – whether to pad data to max_atom

  • forbidden_atoms (list of str, optional) – forbidden atoms

  • verbose (int, optional) – output verbose level

  • **kwargs

save_pickle(pkl_file, verbose=0, cheap_data=False)

Save the dataset to a pickle file.

Parameters:
  • pkl_file (str) – file name

  • verbose (int, optional) – output verbose level

property num_atom_type

Number of different atom types.

property num_atoms

Number of atoms in each molecule.

property tasks

List of tasks.

MolecularDiffusion.data.component.dataset.BASE_ATOM_VOCAB = ['H', 'B', 'C', 'N', 'O', 'F', 'Mg', 'Si', 'P', 'S', 'Cl', 'Cu', 'Zn', 'Ge', 'As', 'Se', 'Br', 'Sn', 'I']
MolecularDiffusion.data.component.dataset.Chem = None
MolecularDiffusion.data.component.dataset.hybiridization_map
MolecularDiffusion.data.component.dataset.logger