MolecularDiffusion.data.component.dataset¶
Attributes¶
Classes¶
Drop-in replacement for PointCloudDataset that never loads the full |
|
Drop-in replacement for GraphDataset that streams pyG Data objects |
|
Module Contents¶
- class MolecularDiffusion.data.component.dataset.GraphDataset¶
Bases:
torch.utils.data.Dataset- atom_types()¶
All atom types.
- get_item(index)¶
- get_property(task)¶
- load_csv(csv_file: str, xyz_dir: str, xyz_field: str = 'xyz', smiles_field: str = 'smiles', target_fields: List[str] | None = None, atom_vocab: List[str] = [], node_feature_choice: str | None = None, forbidden_atoms: List[str] = [], verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, **kwargs)¶
Load the dataset from a csv file.
- Parameters:
csv_file (str) – file name
xyz_dir (str) – directory to store XYZ files
xyz_field (str) – name of the XYZ column in the table
smiles_field (str, optional) – name of the SMILES column in the table. Use
Noneif there is no SMILES column.target_fields (list of str, optional) – name of target columns in the table. Default is all columns other than the SMILES column.
node_feature_choice (str, optional) – geom features to extract
verbose (int, optional) – output verbose level
**kwargs
- load_db(db_path: str, atom_vocab: List[str] = [], node_feature_choice: List[str] | None = None, target_fields: List[str] | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, use_row_data_features: bool = False, **kwargs: Any)¶
Load the dataset from an ASE db file.
- Parameters:
db_path (str) – path to ASE db file
node_feature_choice (list of str, optional) – RDKit atom features to extract
target_fields (list of str, optional) – name of target columns in the table.
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule
with_hydrogen (bool, optional) – whether to add hydrogen atoms
edge_type (str, optional) – type of edge to construct the graph (default: distance, neighbor)
radius (float, optional) – radius to construct the graph (default: 4.0)
n_neigh (int, optional) – number of neighbors to consider (default: 5)
verbose (int, optional) – output verbose level
null_value (float, optional) – null value for missing context data
**kwargs
- load_npy(coords: torch.Tensor, natoms: torch.Tensor, smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, **kwargs: Any)¶
Load the dataset from npy tensors.
- Parameters:
coords (tensor) – tensor of coordinates [total_atoms, 5] with [mol_idx, Z, x, y, z]
natoms (tensor) – tensor of number of atoms per molecule
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule
with_hydrogen (bool, optional) – whether to include hydrogen atoms
edge_type (str, optional) – type of edge to construct the graph
radius (float, optional) – radius to construct the graph
n_neigh (int, optional) – number of neighbors to consider
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs
- load_pickle(pkl_file, verbose=0)¶
Load the dataset from a pickle file.
- load_smiles()¶
- load_xyz(xyz_list: List[str], smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], edge_type: str = 'distance', radius: float = 4.0, n_neigh: int = 5, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, **kwargs: Any)¶
Load the dataset from XYZ and targets.
- Parameters:
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)
with_hydrogen (bool, optional) – whether to add hydrogen atoms
edge_type (str, optional) – type of edge to construct the graph (default: distance, neighbor)
radius (float, optional) – radius to construct the graph (default: 4.0)
n_neigh (int, optional) – number of neighbors to consider (default: 5)
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs
- save_pickle(pkl_file, verbose=0)¶
Save the dataset to a pickle file.
- property num_atom_type¶
Number of different atom types.
- property num_atoms¶
Number of atoms in each molecule.
- property tasks¶
List of tasks.
- class MolecularDiffusion.data.component.dataset.LazyChunkedDataset(chunk_dir: str, cache_chunks: int = 2)¶
Bases:
torch.utils.data.DatasetDrop-in replacement for PointCloudDataset that never loads the full dataset into RAM. Chunks are loaded from disk on demand with a small LRU cache.
- Parameters:
chunk_dir – directory containing chunk_*.pt files and meta.pt
cache_chunks – how many chunks to keep in memory at once (default 2)
- get_property(task: str, indices=None) torch.Tensor¶
- chunk_dir¶
- property num_atoms: torch.Tensor¶
- class MolecularDiffusion.data.component.dataset.LazyChunkedGraphDataset(chunk_dir: str, cache_chunks: int = 2)¶
Bases:
torch.utils.data.DatasetDrop-in replacement for GraphDataset that streams pyG Data objects from disk on demand.
- Each chunk file is a dict with keys:
graph_data_list: List[torch_geometric.data.Data] n_atoms: List[int] smiles_list: List[str] targets: Dict[str, List[float]]
- get_property(task: str, indices=None) torch.Tensor | None¶
- chunk_dir¶
- property num_atoms: torch.Tensor¶
- transform = None¶
- class MolecularDiffusion.data.component.dataset.PointCloudDataset¶
Bases:
torch.utils.data.Dataset- atom_types()¶
All atom types.
- get_item(index)¶
- get_property(task)¶
- get_tabasco_stats()¶
Get dataset statistics required for TABASCO unconditional sampling.
- Returns:
max_atoms: Maximum number of atoms in dataset
num_atom_types: Number of atom types in vocabulary
atom_count_histogram: Histogram of molecule sizes
all_smiles: List of all SMILES strings
- Return type:
Dictionary with
- load_csv(csv_file, xyz_dir, xyz_field='xyz', smiles_field='smiles', target_fields=None, atom_vocab=[], node_feature_choice=None, forbidden_atoms=[], null_value=math.nan, verbose=0, allow_unknown=False, use_ohe_feature=True, **kwargs)¶
Load the dataset from a csv file.
- Parameters:
csv_file (str) – file name
xyz_dir (str) – directory to store XYZ files
xyz_field (str) – name of the XYZ column in the table
smiles_field (str, optional) – name of the SMILES column in the table. Use
Noneif there is no SMILES column.target_fields (list of str, optional) – name of target columns in the table. Default is all columns other than the SMILES column.
node_feature_choice (str, optional) – geom features to extract
null_value (str, optional) – null value for missing targets
verbose (int, optional) – output verbose level
**kwargs
- load_db(db_path: str, atom_vocab: List[str] = [], node_feature_choice: List[str] | None = None, target_fields: List[str] | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, null_value=math.nan, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, use_row_data_features: bool = False, **kwargs: Any)¶
Load the dataset from an ASE db file.
- Parameters:
db_path (str) – path to ASE db file
node_feature_choice (list of str, optional) – RDKit atom features to extract
target_fields (list of str, optional) – name of target columns in the table.
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule
with_hydrogen (bool, optional) – whether to add hydrogen atoms
pad_data (bool, optional) – whether to pad data to max_atom)
verbose (int, optional) – output verbose level
null_value (float, optional) – null value for missing context data
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs
- load_npy(coords: torch.Tensor, natoms: torch.Tensor, smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, **kwargs: Any)¶
Load the dataset from npy and targets.
- Parameters:
coords (tensor) – tensor of coordinates
natoms (tensor) – tensor of number of atoms
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)
with_hydrogen (bool, optional) – whether to add hydrogen atoms
pad_data (bool, optional) – whether to pad data to max_atom
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs
- load_pickle(pkl_file, verbose=0, cheap_data=False)¶
Load the dataset from a pickle file.
- load_xyz(xyz_list: List[str], smiles_list: List[str], targets: Dict[str, List[float | int]], atom_vocab: List[str] = [], node_feature_choice: str | None = None, transform: Callable | None = None, max_atom: int = 200, with_hydrogen: bool = True, forbidden_atoms: List[str] = [], pad_data: bool = False, verbose: int = 0, allow_unknown: bool = False, use_ohe_feature: bool = True, chunk_size: int | None = None, chunk_dir: str | None = None, compact: Dict[str, bool] | None = None, **kwargs: Any)¶
Load the dataset from XYZ and targets.
- Parameters:
node_feature_choice (str, optional) – geom features to extract
transform (Callable, optional) – data transformation function
max_atom (int, optional) – maximum number of atoms in a molecule (default: 120)
with_hydrogen (bool, optional) – whether to add hydrogen atoms
pad_data (bool, optional) – whether to pad data to max_atom
verbose (int, optional) – output verbose level
chunk_size (int, optional) – flush to disk every N molecules to limit RAM.
chunk_dir (str, optional) – directory to write chunk files.
**kwargs
- save_pickle(pkl_file, verbose=0, cheap_data=False)¶
Save the dataset to a pickle file.
- property num_atom_type¶
Number of different atom types.
- property num_atoms¶
Number of atoms in each molecule.
- property tasks¶
List of tasks.
- MolecularDiffusion.data.component.dataset.BASE_ATOM_VOCAB = ['H', 'B', 'C', 'N', 'O', 'F', 'Mg', 'Si', 'P', 'S', 'Cl', 'Cu', 'Zn', 'Ge', 'As', 'Se', 'Br', 'Sn', 'I']¶
- MolecularDiffusion.data.component.dataset.Chem = None¶
- MolecularDiffusion.data.component.dataset.hybiridization_map¶
- MolecularDiffusion.data.component.dataset.lmdb = None¶
- MolecularDiffusion.data.component.dataset.logger¶