MolecularDiffusion.data.component.pharmacophore¶

Pharmacophore Dataset for ShEPhERD integration.

Faithful port of shepherd/src/shepherd/datasets.py HeteroDataset. Handles loading, processing, and forward-noising of 4 molecular modalities: - x1: Molecular structure (atoms + bonds) - x2: Shape (surface point cloud) - x3: Electrostatics (surface point cloud + ESP) - x4: Pharmacophores (points + directions)

Forward-noising is applied per-sample in __getitem__ using the noise schedule from the ShEPhERD checkpoint hyperparameters.

Attributes¶

`Chem`
`logger`
`o3d`

Classes¶

`PharmacophoreDataModule`	DataModule wrapper for PharmacophoreDataset.
`PharmacophoreDataset`	Dataset for ShEPhERD pharmacophore diffusion.

Functions¶

`build_default_noise_schedule`([T])	Build the default ShEPhERD cosine+linear noise schedule.
`pharmacophore_collate_fn`(→ torch_geometric.data.HeteroData)	Collate HeteroData samples into a batch.
`sample_timestep_biased`(ts)	Biased timestep sampling matching ShEPhERD:

Module Contents¶

class MolecularDiffusion.data.component.pharmacophore.PharmacophoreDataModule(pkl_files: List[str], root: str = 'data/pharmacophore_cache', dataset_name: str = 'pharmacophore', num_samples: int | None = None, data_fraction: float = 1.0, batch_size: int = 32, num_workers: int = 0, train_ratio: float = 0.8, task_type: str = 'diffusion_pharmacophore', checkpoint_path: str | None = None, **kwargs)¶

DataModule wrapper for PharmacophoreDataset.

Loads molblocks+charges from pkl files, builds noise schedule (from checkpoint or default), and creates train/val/test splits.

load()¶: Load pkl data and create dataset with proper noise schedule.

batch_size = 32¶

checkpoint_path = None¶

collate_fn¶

data_fraction = 1.0¶

dataset_name = 'pharmacophore'¶

kwargs¶

num_samples = None¶

num_workers = 0¶

pkl_files¶

root = 'data/pharmacophore_cache'¶

task_type = 'diffusion_pharmacophore'¶

test_set = None¶

train_ratio = 0.8¶

train_set = None¶

valid_set = None¶

class MolecularDiffusion.data.component.pharmacophore.PharmacophoreDataset(molblocks_and_charges: List, noise_schedule_dict: Dict[str, Dict], compute_x1: bool = True, compute_x2: bool = False, compute_x3: bool = True, compute_x4: bool = True, recenter_x1: bool = True, add_virtual_node_x1: bool = True, remove_noise_COM_x1: bool = True, atom_types_x1: List = None, charge_types_x1: List = None, bond_types_x1: List = None, scale_atom_features_x1: float = 0.25, scale_bond_features_x1: float = 1.0, formal_charge_diffusion: bool = True, independent_timesteps_x2: bool = False, recenter_x2: bool = False, add_virtual_node_x2: bool = True, remove_noise_COM_x2: bool = False, num_points_x2: int = 75, independent_timesteps_x3: bool = False, recenter_x3: bool = False, add_virtual_node_x3: bool = True, remove_noise_COM_x3: bool = False, num_points_x3: int = 75, scale_node_features_x3: float = 2.0, independent_timesteps_x4: bool = False, recenter_x4: bool = False, add_virtual_node_x4: bool = True, remove_noise_COM_x4: bool = False, max_node_types_x4: int = 10, scale_node_features_x4: float = 2.0, scale_vector_features_x4: float = 2.0, multivectors: bool = False, check_accessibility: bool = False, probe_radius: float = 0.6)¶

Bases: torch.utils.data.Dataset

Dataset for ShEPhERD pharmacophore diffusion.

Each __getitem__ returns a torch_geometric.data.HeteroData with forward-noised fields for x1, x2, x3, x4 — exactly matching shepherd/src/shepherd/datasets.py HeteroDataset.

get_x1_data(mol, t, alpha_dash_t, sigma_dash_t)¶

get_x2_data(radii, atom_centers, num_points, recenter, add_virtual_node, remove_noise_COM, t, alpha_dash_t, sigma_dash_t, virtual_node_pos=None)¶

get_x3_data_electrostatics_only(charges, charge_centers, data, pos, virtual_node_mask, t, alpha_dash_t, sigma_dash_t)¶

get_x4_data(mol, recenter, add_virtual_node, remove_noise_COM, t, alpha_dash_t, sigma_dash_t, virtual_node_pos=None)¶

add_virtual_node_x1 = True¶

add_virtual_node_x2 = True¶

add_virtual_node_x3 = True¶

add_virtual_node_x4 = True¶

atom_types_x1 = [None, 'H', 'C', 'N', 'O', 'F', 'Cl', 'Br', 'I', 'S', 'P', 'Si']¶

bond_types_x1 = [None, 'SINGLE', 'DOUBLE', 'TRIPLE', 'AROMATIC']¶

charge_types_x1¶

check_accessibility = False¶

compute_x1 = True¶

compute_x2 = False¶

compute_x3 = True¶

compute_x4 = True¶

formal_charge_diffusion = True¶

independent_timesteps_x2 = False¶

independent_timesteps_x3 = False¶

independent_timesteps_x4 = False¶

max_node_types_x4 = 10¶

molblocks_and_charges¶

multivectors = False¶

noise_schedule_dict¶

num_points_x2 = 75¶

num_points_x3 = 75¶

probe_radius = 0.6¶

recenter_x1 = True¶

recenter_x2 = False¶

recenter_x3 = False¶

recenter_x4 = False¶

remove_noise_COM_x1 = True¶

remove_noise_COM_x2 = False¶

remove_noise_COM_x3 = False¶

remove_noise_COM_x4 = False¶

scale_atom_features_x1 = 0.25¶

scale_bond_features_x1 = 1.0¶

scale_node_features_x3 = 2.0¶

scale_node_features_x4 = 2.0¶

scale_vector_features_x4 = 2.0¶

MolecularDiffusion.data.component.pharmacophore.build_default_noise_schedule(T=400)¶: Build the default ShEPhERD cosine+linear noise schedule. Matches shepherd/training/parameters/params_x1x3x4_diffusion_mosesaq_20240824.py

MolecularDiffusion.data.component.pharmacophore.pharmacophore_collate_fn(batch: List[torch_geometric.data.HeteroData]) → torch_geometric.data.HeteroData¶: Collate HeteroData samples into a batch. Uses torch_geometric’s built-in Batch.from_data_list for HeteroData.

MolecularDiffusion.data.component.pharmacophore.sample_timestep_biased(ts)¶: Biased timestep sampling matching ShEPhERD: - 7.5% from high-noise end (t=1..50) - 75% from middle (t=50..250) - 17.5% from low-noise start (t=250..400)

MolecularDiffusion.data.component.pharmacophore.Chem = None¶

MolecularDiffusion.data.component.pharmacophore.logger¶

MolecularDiffusion.data.component.pharmacophore.o3d = None¶