MolecularDiffusion.data.component.pharmacophore¶
Pharmacophore Dataset for ShEPhERD integration.
Faithful port of shepherd/src/shepherd/datasets.py HeteroDataset. Handles loading, processing, and forward-noising of 4 molecular modalities: - x1: Molecular structure (atoms + bonds) - x2: Shape (surface point cloud) - x3: Electrostatics (surface point cloud + ESP) - x4: Pharmacophores (points + directions)
Forward-noising is applied per-sample in __getitem__ using the noise schedule from the ShEPhERD checkpoint hyperparameters.
Attributes¶
Classes¶
DataModule wrapper for PharmacophoreDataset. |
|
Dataset for ShEPhERD pharmacophore diffusion. |
Functions¶
Build the default ShEPhERD cosine+linear noise schedule. |
|
|
Collate HeteroData samples into a batch. |
Biased timestep sampling matching ShEPhERD: |
Module Contents¶
- class MolecularDiffusion.data.component.pharmacophore.PharmacophoreDataModule(pkl_files: List[str], root: str = 'data/pharmacophore_cache', dataset_name: str = 'pharmacophore', num_samples: int | None = None, data_fraction: float = 1.0, batch_size: int = 32, num_workers: int = 0, train_ratio: float = 0.8, task_type: str = 'diffusion_pharmacophore', checkpoint_path: str | None = None, **kwargs)¶
DataModule wrapper for PharmacophoreDataset.
Loads molblocks+charges from pkl files, builds noise schedule (from checkpoint or default), and creates train/val/test splits.
- load()¶
Load pkl data and create dataset with proper noise schedule.
- batch_size = 32¶
- checkpoint_path = None¶
- collate_fn¶
- data_fraction = 1.0¶
- dataset_name = 'pharmacophore'¶
- kwargs¶
- num_samples = None¶
- num_workers = 0¶
- pkl_files¶
- root = 'data/pharmacophore_cache'¶
- task_type = 'diffusion_pharmacophore'¶
- test_set = None¶
- train_ratio = 0.8¶
- train_set = None¶
- valid_set = None¶
- class MolecularDiffusion.data.component.pharmacophore.PharmacophoreDataset(molblocks_and_charges: List, noise_schedule_dict: Dict[str, Dict], compute_x1: bool = True, compute_x2: bool = False, compute_x3: bool = True, compute_x4: bool = True, recenter_x1: bool = True, add_virtual_node_x1: bool = True, remove_noise_COM_x1: bool = True, atom_types_x1: List = None, charge_types_x1: List = None, bond_types_x1: List = None, scale_atom_features_x1: float = 0.25, scale_bond_features_x1: float = 1.0, formal_charge_diffusion: bool = True, independent_timesteps_x2: bool = False, recenter_x2: bool = False, add_virtual_node_x2: bool = True, remove_noise_COM_x2: bool = False, num_points_x2: int = 75, independent_timesteps_x3: bool = False, recenter_x3: bool = False, add_virtual_node_x3: bool = True, remove_noise_COM_x3: bool = False, num_points_x3: int = 75, scale_node_features_x3: float = 2.0, independent_timesteps_x4: bool = False, recenter_x4: bool = False, add_virtual_node_x4: bool = True, remove_noise_COM_x4: bool = False, max_node_types_x4: int = 10, scale_node_features_x4: float = 2.0, scale_vector_features_x4: float = 2.0, multivectors: bool = False, check_accessibility: bool = False, probe_radius: float = 0.6)¶
Bases:
torch.utils.data.DatasetDataset for ShEPhERD pharmacophore diffusion.
Each __getitem__ returns a torch_geometric.data.HeteroData with forward-noised fields for x1, x2, x3, x4 — exactly matching shepherd/src/shepherd/datasets.py HeteroDataset.
- get_x1_data(mol, t, alpha_dash_t, sigma_dash_t)¶
- get_x2_data(radii, atom_centers, num_points, recenter, add_virtual_node, remove_noise_COM, t, alpha_dash_t, sigma_dash_t, virtual_node_pos=None)¶
- get_x3_data_electrostatics_only(charges, charge_centers, data, pos, virtual_node_mask, t, alpha_dash_t, sigma_dash_t)¶
- get_x4_data(mol, recenter, add_virtual_node, remove_noise_COM, t, alpha_dash_t, sigma_dash_t, virtual_node_pos=None)¶
- add_virtual_node_x1 = True¶
- add_virtual_node_x2 = True¶
- add_virtual_node_x3 = True¶
- add_virtual_node_x4 = True¶
- atom_types_x1 = [None, 'H', 'C', 'N', 'O', 'F', 'Cl', 'Br', 'I', 'S', 'P', 'Si']¶
- bond_types_x1 = [None, 'SINGLE', 'DOUBLE', 'TRIPLE', 'AROMATIC']¶
- charge_types_x1¶
- check_accessibility = False¶
- compute_x1 = True¶
- compute_x2 = False¶
- compute_x3 = True¶
- compute_x4 = True¶
- formal_charge_diffusion = True¶
- independent_timesteps_x2 = False¶
- independent_timesteps_x3 = False¶
- independent_timesteps_x4 = False¶
- max_node_types_x4 = 10¶
- molblocks_and_charges¶
- multivectors = False¶
- noise_schedule_dict¶
- num_points_x2 = 75¶
- num_points_x3 = 75¶
- probe_radius = 0.6¶
- recenter_x1 = True¶
- recenter_x2 = False¶
- recenter_x3 = False¶
- recenter_x4 = False¶
- remove_noise_COM_x1 = True¶
- remove_noise_COM_x2 = False¶
- remove_noise_COM_x3 = False¶
- remove_noise_COM_x4 = False¶
- scale_atom_features_x1 = 0.25¶
- scale_bond_features_x1 = 1.0¶
- scale_node_features_x3 = 2.0¶
- scale_node_features_x4 = 2.0¶
- scale_vector_features_x4 = 2.0¶
- MolecularDiffusion.data.component.pharmacophore.build_default_noise_schedule(T=400)¶
Build the default ShEPhERD cosine+linear noise schedule. Matches shepherd/training/parameters/params_x1x3x4_diffusion_mosesaq_20240824.py
- MolecularDiffusion.data.component.pharmacophore.pharmacophore_collate_fn(batch: List[torch_geometric.data.HeteroData]) torch_geometric.data.HeteroData¶
Collate HeteroData samples into a batch. Uses torch_geometric’s built-in Batch.from_data_list for HeteroData.
- MolecularDiffusion.data.component.pharmacophore.sample_timestep_biased(ts)¶
Biased timestep sampling matching ShEPhERD: - 7.5% from high-noise end (t=1..50) - 75% from middle (t=50..250) - 17.5% from low-noise start (t=250..400)
- MolecularDiffusion.data.component.pharmacophore.Chem = None¶
- MolecularDiffusion.data.component.pharmacophore.logger¶
- MolecularDiffusion.data.component.pharmacophore.o3d = None¶