Tutorial: Scoring Function Design¶

This tutorial covers how to design and configure scoring functions for RL runs, and how to write custom scoring components. For the mechanics of running RL, see the RL tutorial.

Formulating a Scoring Function¶

A scoring function is a weighted combination of components. Each component computes one property per SMILES, maps the raw value to [0, 1] via a transform, and contributes to an aggregated total score.

Step 1 — Identify your objectives¶

List the properties your molecules must satisfy. Separate them into:

Hard constraints (must-pass): structural alerts, reactive groups, forbidden substructures → use custom_alerts or MatchingSubstructure
Soft objectives (optimise toward): QED, LogP, MW, docking score → weighted components with transforms

Start with as few components as possible. Each additional component dilutes the signal from the others — especially with geometric_mean aggregation.

Step 2 — Choose transforms¶

Transforms map raw values to [0, 1]. Choose based on the shape of your objective:

Transform	Use when	Key params
`sigmoid`	Higher is better (e.g. similarity, QED)	`low`, `high`, `k`
`reverse_sigmoid`	Lower is better (e.g. LogP, rotatable bonds)	`low`, `high`, `k`
`double_sigmoid`	Value should stay within a range (e.g. MW 200–500)	`low`, `high`, `coef_div`, `coef_si`, `coef_se`
`step`	Hard window — 1 inside, 0 outside (e.g. stereocenters ≤ 3)	`low`, `high`

For sigmoid and reverse_sigmoid, k controls steepness: higher k = sharper transition. For double_sigmoid, coef_si and coef_se control the steepness of the left and right edges; coef_div is the normalisation divisor (typically set to high).

Step 3 — Set weights¶

Weights are relative — they scale each component’s contribution before aggregation. A component with weight = 2.0 counts twice as much as one with weight = 1.0. With geometric_mean aggregation, a component scoring 0 pulls the total to 0 regardless of weight.

Recommended approach:

Set all weights to 1.0 initially.
Run a scoring-only job on a test set (see Scoring tutorial) to inspect the distribution of each component.
Adjust weights iteratively based on which objectives are being under- or over-optimised.

Step 4 — Validate before RL¶

Always run a scoring job on a representative SMILES set before starting RL. This confirms transforms are correctly shaped and weights are sensible — without spending GPU time.

Built-in Scoring Components¶

All built-in components live in reinvent_plugins/components/. The TOML name is the class name (case-insensitive).

Physico-chemical (RDKit)¶

TOML name	Property
`QED`	Drug-likeness score (0–1)
`SlogP`	Crippen LogP
`MolecularWeight`	Molecular weight (Da)
`TPSA`	Topological polar surface area
`HBondAcceptors`	H-bond acceptors (Lipinski)
`HBondDonors`	H-bond donors (Lipinski)
`NumRotBond`	Rotatable bonds
`NumRings`	Total rings
`NumAromaticRings`	Aromatic rings
`NumAliphaticRings`	Aliphatic rings
`NumHeavyAtoms`	Heavy atom count
`Csp3`	Fraction of sp3 carbons
`SAScore`	Synthetic accessibility (1–10, lower = easier)
`PMI`	Principal moment of inertia (`npr1` or `npr2` via `params.property`)

Similarity and substructure¶

TOML name	Description
`TanimotoDistance`	Tanimoto similarity to a reference SMILES (`params.smiles`, `params.radius`)
`GroupCount`	Count of a SMARTS pattern (`params.smarts`); filter applied before other components
`MatchingSubstructure`	Penalty multiplier if SMARTS is present (`params.smarts`)
`custom_alerts`	Global filter — zeros total score if any SMARTS matches (`params.smarts` list)
`MMP`	Matched molecular pair similarity to a reference

External¶

TOML name	Description
`DockStream`	Docking via DockStream
`Maize`	Generic workflow runner (docking, solubility, etc.)
`ChemProp`	D-MPNN QSAR models
`ExternalProcess`	Run any external executable; communicates via JSON on stdin/stdout
`REST`	Generic REST API interface

Writing a Custom Scoring Component¶

How components are discovered¶

REINVENT4 scans the reinvent_plugins.components namespace for all files whose name starts with comp_. No manual registration is needed — placing the file in the right location is sufficient.

File location: reinvent_plugins/components/comp_<yourname>.py

Subdirectories are also scanned, so reinvent_plugins/components/MyTool/comp_mytool.py works too.

Interface¶

A component consists of two classes in the same file:

A parameters dataclass tagged with @add_tag("__parameters") — holds all user-configurable inputs. All fields must be List (even if only one endpoint is used), because the framework supports multiple endpoints per component.
A component class tagged with @add_tag("__component") — implements __init__ and __call__.

The __call__ method either receives a list of SMILES strings (default) or a list of Chem.Mol objects if decorated with @molcache. It must return a ComponentResults object.

Component tags:

@add_tag("__component") — standard scoring component
@add_tag("__component", "filter") — global filter; zeros total score if this component scores 0
@add_tag("__component", "penalty") — penalty; multiplied against total score

Failures: use np.nan for molecules that could not be scored. Do not use 0.

Minimal example¶

"""Scores molecules by the number of nitrogen atoms (example)."""

__all__ = ["NitrogenCount"]
from typing import List

import numpy as np
from pydantic.dataclasses import dataclass
from rdkit import Chem

from .component_results import ComponentResults
from reinvent_plugins.mol_cache import molcache
from .add_tag import add_tag


@add_tag("__parameters")
@dataclass
class Parameters:
    # All fields must be List — one entry per endpoint
    # No params needed for this example, but the class must exist
    pass


@add_tag("__component")
class NitrogenCount:
    def __init__(self, params: Parameters):
        pass  # no parameters to read

    @molcache  # converts SMILES list to Chem.Mol list before calling __call__
    def __call__(self, mols: List[Chem.Mol]) -> ComponentResults:
        scores = []
        for mol in mols:
            if mol is None:
                scores.append(np.nan)
            else:
                n = sum(1 for a in mol.GetAtoms() if a.GetAtomicNum() == 7)
                scores.append(float(n))

        return ComponentResults([np.array(scores, dtype=float)])

Using the component in TOML¶

Once the file is in place, use the class name as the component key:

[[scoring.component]]
[scoring.component.NitrogenCount]
[[scoring.component.NitrogenCount.endpoint]]
name = "N count"
weight = 1.0
transform.type = "reverse_sigmoid"
transform.low = 0
transform.high = 5
transform.k = 0.5

Example with parameters¶

If your component needs user-supplied values (e.g. a reference SMILES or a file path):

@add_tag("__parameters")
@dataclass
class Parameters:
    threshold: List[float]   # one per endpoint


@add_tag("__component")
class MyComponent:
    def __init__(self, params: Parameters):
        self.threshold = params.threshold[0]  # index 0 = first endpoint

    @molcache
    def __call__(self, mols: List[Chem.Mol]) -> ComponentResults:
        scores = [1.0 if mol and self._score(mol) > self.threshold else 0.0
                  for mol in mols]
        return ComponentResults([np.array(scores, dtype=float)])

In TOML:

[[scoring.component]]
[scoring.component.MyComponent]
[[scoring.component.MyComponent.endpoint]]
name = "My score"
weight = 1.0
params.threshold = [0.5]   # must be a list