MolecularDiffusion.data.component.dataset¶

Attributes¶

`BASE_ATOM_VOCAB`
`Chem`
`hybiridization_map`
`lmdb`
`logger`

Classes¶

`GraphDataset`
`LazyChunkedDataset`	Drop-in replacement for PointCloudDataset that never loads the full
`LazyChunkedGraphDataset`	Drop-in replacement for GraphDataset that streams pyG Data objects
`PointCloudDataset`

Module Contents¶

class MolecularDiffusion.data.component.dataset.GraphDataset¶

Bases: torch.utils.data.Dataset

atom_types()¶: All atom types.

get_item(index)¶

get_property(task)¶

load_csv(csv_file: str, xyz_dir: str, xyz_field: str = 'xyz', smiles_field: str = 'smiles', target_fields: List[str] | None = None, atom_vocab: List[str] = [], node_feature_choice: str | None = None, forbidden_atoms: List[str] = [], verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs)¶

Load the dataset from a csv file.

Parameters:

csv_file (str) – file name
xyz_dir (str) – directory to store XYZ files
xyz_field (str) – name of the XYZ column in the table
smiles_field (str, optional) – name of the SMILES column in the table. Use None if there is no SMILES column.
target_fields (list of str, optional) – name of target columns in the table. Default is all columns other than the SMILES column.
atom_vocab (list of str, optional) – atom types
node_feature_choice (str, optional) – geom features to extract
forbidden_atoms (list of str, optional) – forbidden atoms
verbose (int, optional) – output verbose level
**kwargs

load_db(db_path: str, atom_vocab: List[str] = [], node_feature_choice: List[str] | None = None, target_fields: List[str] | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, use_row_data_features: bool = False, **kwargs: Any)¶

Load the dataset from an ASE db file.

Parameters:

db_path (str) – path to ASE db file
atom_vocab (list of str, optional) – atom types
node_feature_choice (list of str, optional) – RDKit atom features to extract
target_fields (list of str, optional) – name of target columns in the table.
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule
with_hydrogen (bool, optional) – whether to add hydrogen atoms
forbidden_atoms (list of str, optional) – forbidden atoms
edge_type (str, optional) – type of edge to construct the graph (default: distance, neighbor)
radius (float, optional) – radius to construct the graph (default: 4.0)
n_neigh (int, optional) – number of neighbors to consider (default: 5)
verbose (int, optional) – output verbose level
null_value (float, optional) – null value for missing context data
**kwargs

load_npy(coords: torch.Tensor, natoms: torch.Tensor, smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, **kwargs: Any)¶

Load the dataset from npy tensors.

Parameters:

coords (tensor) – tensor of coordinates [total_atoms, 5] with [mol_idx, Z, x, y, z]
natoms (tensor) – tensor of number of atoms per molecule
smiles_list (list of str) – SMILES strings
targets (dict of list) – prediction targets
atom_vocab (list of str) – atom types
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule
with_hydrogen (bool, optional) – whether to include hydrogen atoms
forbidden_atoms (list of str, optional) – forbidden atoms
edge_type (str, optional) – type of edge to construct the graph
radius (float, optional) – radius to construct the graph
n_neigh (int, optional) – number of neighbors to consider
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs

load_pickle(pkl_file, verbose=0)¶

Load the dataset from a pickle file.

Parameters:

pkl_file (str) – file name
verbose (int, optional) – output verbose level

load_smiles()¶

load_xyz(xyz_list: List[str], smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, **kwargs: Any)¶

Load the dataset from XYZ and targets.

Parameters:

xyz_list (list of str) – XYZ file names
smiles_list (list of str) – SMILES strings
targets (dict of list) – prediction targets
atom_vocab (list of str) – atom types
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)
with_hydrogen (bool, optional) – whether to add hydrogen atoms
forbidden_atoms (list of str, optional) – forbidden atoms
edge_type (str, optional) – type of edge to construct the graph (default: distance, neighbor)
radius (float, optional) – radius to construct the graph (default: 4.0)
n_neigh (int, optional) – number of neighbors to consider (default: 5)
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs

save_pickle(pkl_file, verbose=0)¶

Save the dataset to a pickle file.

Parameters:

pkl_file (str) – file name
verbose (int, optional) – output verbose level

property num_atom_type¶: Number of different atom types.

property num_atoms¶: Number of atoms in each molecule.

property tasks¶: List of tasks.

class MolecularDiffusion.data.component.dataset.LazyChunkedDataset(chunk_dir: str, cache_chunks: int = 2)¶

Bases: torch.utils.data.Dataset

Drop-in replacement for PointCloudDataset that never loads the full dataset into RAM. Chunks are loaded from disk on demand with a small LRU cache.

Parameters:

chunk_dir – directory containing chunk_*.pt files and meta.pt
cache_chunks – how many chunks to keep in memory at once (default 2)

atom_types() → List[int]¶

get_item(index: int)¶

get_property(task: str, indices=None) → torch.Tensor¶

atom_vocab: List[str]¶

chunk_dir¶

chunk_paths: List[str]¶

chunk_sizes: List[int]¶

n_atoms: List[int]¶

property num_atoms: torch.Tensor¶

smiles_list: List[str]¶

property targets: dict¶

property tasks: List[str]¶

with_hydrogen: bool¶

class MolecularDiffusion.data.component.dataset.LazyChunkedGraphDataset(chunk_dir: str, cache_chunks: int = 2)¶

Bases: torch.utils.data.Dataset

Drop-in replacement for GraphDataset that streams pyG Data objects from disk on demand.

Each chunk file is a dict with keys:: graph_data_list: List[torch_geometric.data.Data] n_atoms: List[int] smiles_list: List[str] targets: Dict[str, List[float]]

atom_types() → List[int]¶

get_item(index: int)¶

get_property(task: str, indices=None) → torch.Tensor | None¶

atom_vocab: List[str]¶

chunk_dir¶

chunk_paths: List[str]¶

chunk_sizes: List[int]¶

n_atoms: List[int]¶

property num_atom_type: int¶

property num_atoms: torch.Tensor¶

smiles_list: List[str]¶

property targets: dict¶

property tasks: List[str]¶

transform = None¶

with_hydrogen: bool¶

class MolecularDiffusion.data.component.dataset.PointCloudDataset¶

Bases: torch.utils.data.Dataset

atom_types()¶: All atom types.

get_item(index)¶

get_property(task)¶

get_tabasco_stats()¶

Get dataset statistics required for TABASCO unconditional sampling.

Returns:

max_atoms: Maximum number of atoms in dataset
num_atom_types: Number of atom types in vocabulary
atom_count_histogram: Histogram of molecule sizes
all_smiles: List of all SMILES strings

Return type:

Dictionary with

load_csv(csv_file, xyz_dir, xyz_field='xyz', smiles_field='smiles', target_fields=None, atom_vocab=[], node_feature_choice=None, forbidden_atoms=[], null_value=math.nan, verbose=0, allow_unknown=False, use_ohe_feature=True, **kwargs)¶

Load the dataset from a csv file.

Parameters:

csv_file (str) – file name
xyz_dir (str) – directory to store XYZ files
xyz_field (str) – name of the XYZ column in the table
smiles_field (str, optional) – name of the SMILES column in the table. Use None if there is no SMILES column.
target_fields (list of str, optional) – name of target columns in the table. Default is all columns other than the SMILES column.
atom_vocab (list of str, optional) – atom types
node_feature_choice (str, optional) – geom features to extract
forbidden_atoms (list of str, optional) – forbidden atoms
null_value (str, optional) – null value for missing targets
verbose (int, optional) – output verbose level
**kwargs

load_db(db_path: str, atom_vocab: List[str] = [], node_feature_choice: List[str] | None = None, target_fields: List[str] | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, null_value=math.nan, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, use_row_data_features: bool = False, **kwargs: Any)¶

Load the dataset from an ASE db file.

Parameters:

db_path (str) – path to ASE db file
atom_vocab (list of str, optional) – atom types
node_feature_choice (list of str, optional) – RDKit atom features to extract
target_fields (list of str, optional) – name of target columns in the table.
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule
with_hydrogen (bool, optional) – whether to add hydrogen atoms
forbidden_atoms (list of str, optional) – forbidden atoms
pad_data (bool, optional) – whether to pad data to max_atom)
verbose (int, optional) – output verbose level
null_value (float, optional) – null value for missing context data
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs

load_npy(coords: torch.Tensor, natoms: torch.Tensor, smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, **kwargs: Any)¶

Load the dataset from npy and targets.

Parameters:

coords (tensor) – tensor of coordinates
natoms (tensor) – tensor of number of atoms
smiles_list (list of str) – SMILES strings
targets (dict of list) – prediction targets
atom_vocab (list of str) – atom types
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)
with_hydrogen (bool, optional) – whether to add hydrogen atoms
forbidden_atoms (list of str, optional) – forbidden atoms
pad_data (bool, optional) – whether to pad data to max_atom
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs

load_pickle(pkl_file, verbose=0, cheap_data=False)¶

Load the dataset from a pickle file.

Parameters:

pkl_file (str) – file name
verbose (int, optional) – output verbose level

load_xyz(xyz_list: List[str], smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, **kwargs: Any)¶

Load the dataset from XYZ and targets.

Parameters:

xyz_list (list of str) – XYZ file names
smiles_list (list of str) – SMILES strings
targets (dict of list) – prediction targets
atom_vocab (list of str) – atom types
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)
with_hydrogen (bool, optional) – whether to add hydrogen atoms
pad_data (bool, optional) – whether to pad data to max_atom
forbidden_atoms (list of str, optional) – forbidden atoms
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs

save_pickle(pkl_file, verbose=0, cheap_data=False)¶

Save the dataset to a pickle file.

Parameters:

pkl_file (str) – file name
verbose (int, optional) – output verbose level

property num_atom_type¶: Number of different atom types.

property num_atoms¶: Number of atoms in each molecule.

property tasks¶: List of tasks.

MolecularDiffusion.data.component.dataset.BASE_ATOM_VOCAB = ['H', 'B', 'C', 'N', 'O', 'F', 'Mg', 'Si', 'P', 'S', 'Cl', 'Cu', 'Zn', 'Ge', 'As', 'Se', 'Br', 'Sn', 'I']¶

MolecularDiffusion.data.component.dataset.Chem = None¶

MolecularDiffusion.data.component.dataset.hybiridization_map¶

MolecularDiffusion.data.component.dataset.lmdb = None¶

MolecularDiffusion.data.component.dataset.logger¶