Full pipeline (ProQSAR)

The proqsar.qsar.ProQSAR class provides a single-call, opinionated end-to-end QSAR workflow that chains the core modules (standardization, featurization, preprocessing, splitting, feature selection, model development, hyperparameter optimisation and evaluation) into a reproducible experiment.

This example demonstrates a reproducible run: training + optimisation followed by inference with prediction intervals and applicability-domain flags.

Training & optimisation (reproducible)

import pandas as pd
from proqsar.qsar import ProQSAR
from proqsar.Config.config import Config

# small reproducible demo dataset (public)
url = "https://raw.githubusercontent.com/Medicine-Artificial-Intelligence/ProQSAR/main/Data/testcase.csv"
data = pd.read_csv(url).iloc[:50, :]   # use small sample for quick demo
data['id'] = data.index

# centralised config: change splitter / optimizer settings here
cfgs = Config(
    splitter={'option': 'scaffold'},
    optimizer={'n_trials': 50}
)

# create pipeline — set random_state for reproducibility
pipeline = ProQSAR(
    activity_col='pChEMBL',
    id_col='id',
    smiles_col='Smiles',
    n_jobs=4,
    project_name='Demo',
    scoring_target='r2',
    n_splits=5,
    n_repeats=5,
    config=cfgs,
    random_state=42
)

# run the full training + optimisation pipeline (alpha used for internal statistical tests)
result = pipeline.run_all(pd.DataFrame(data), alpha=0.05)

Notes: - For full reproducibility ensure: fixed random_state; deterministic platform (same Python/RDKit/NumPy versions); and a fixed Optuna seed when running long studies (set in Config/optimizer if supported). - Use a small n_trials and CV folds for quick debugging; increase for production.

Pipeline summary (object repr)

After successful training the pipeline prints a concise summary. Example:

┌────────────────────────────────────────────────────────────────────┐
│ ProQSAR Pipeline                                                   │
├────────────────────────────────────────────────────────────────────┤
│ Project: Demo                                                      │
│ Save Dir: Project/Demo                                             │
│ Selected feature: 'RDK5'                                           │
│ Fitted: True                                                       │
│ Models registered: 1                                               │
│ Selected model: 'XGBRegressor'                                     │
│ CV (XGBRegressor): 0.683 ± 0.298                                   │
│ n_jobs: 4                                                          │
│ scoring_target: 'r2'                                               │
│ Optimizer: enabled    ConfPred: enabled    AD: enabled             │
└────────────────────────────────────────────────────────────────────┘

Key fields: - Selected feature — name of the selected feature set (e.g. RDK5). - Selected model — model family chosen after benchmarking. - CV (...) — cross-validated score ± std (scoring_target). - Optimizer, ConfPred, AD — flags for hyperparameter search, conformal prediction, and applicability-domain checks.

Inference (predict using Inference)

Use the proqsar.infer.Inference wrapper to run predictions with the same output format returned by pipeline.predict() (point predictions, conformal prediction intervals and applicability-domain flags). The wrapper additionally stores metadata about the last run and prints a compact summary when the Inference object is printed.

import pandas as pd
from proqsar.infer import Inference

# load test / new set
url = "https://raw.githubusercontent.com/Medicine-Artificial-Intelligence/ProQSAR/main/Data/testcase.csv"
test = pd.read_csv(url)
# optional: create id column if you prefer an explicit id in output
test["id"] = test.index

# wrap the trained pipeline (inplace controls whether input DF may be modified)
infer = Inference(pipeline, inplace=True)

# run inference (id_key=None => index will be used; ground_truth optional)
preds_df = infer.run(
    test,
    smiles_key="Smiles",
    id_key=None,             # None means infer will pass-through the index as `id`
    ground_truth="pChEMBL",  # provide if you want observed activity in output
    alpha=0.05               # 95% prediction intervals
)

# preds_df is the same-format table produced by pipeline.predict(...)
print(preds_df.head())

Example output (first rows)

id    pChEMBL    Predicted value   Prediction Interval (alpha=0.05)    Applicability domain
0     7.698970   6.720965          [4.429, 8.584]                          in
1     6.576754   7.760520          [5.270, 9.386]                          in
2     5.970000   6.018114          [4.426, 8.542]                          in
3     5.602060   5.681627          [3.624, 7.740]                          in
4     5.397940   5.718508          [3.785, 7.902]                          in
...
49    5.761954   5.681627          [3.624, 7.740]                          in

Column explanations

  • id — input sample identifier (passed-through). When id_key=None, the DataFrame index is used and passed through as id.

  • pChEMBL — observed activity if present (passed-through / used for evaluation).

  • Predicted value — model point prediction (mean/median depending on estimator/wrapping).

  • Prediction Interval (alpha=0.05) — conformal prediction interval for the chosen alpha.

  • Applicability domain — in/out flag indicating whether the sample lies within the model’s AD.

Printing the Inference object (compact summary)

After running Inference.run(), printing the Inference object displays a compact summary for the last run (row count, AD split, prediction statistics and quantiles). Example:

┌────────────────────────────────────────────────────────────────────────┐
│ Inference (ProQSAR)                                                    │
├────────────────────────────────────────────────────────────────────────┤
│ Project: Demo                                                          │
│ Save Dir: Project/Demo                                                 │
│ Selected feature: 'RDK5'                                               │
│ Last run (rows): 50                                                    │
│ Applicability domain column: Applicability domain                      │
│ AD: in=50 (100.00%)  out=0                                             │
│ Predictions — mean: 5.978  std: 1.208  nan%: 0.00%                     │
│ Quantiles (10/50/90): 4.619 / 5.730 / 7.778                            │
└────────────────────────────────────────────────────────────────────────┘

Notes

  • inplace: when True the input DataFrame may be modified in-place; set inplace=False to preserve the original.

  • alpha: conformal prediction level (e.g. 0.05 → 95% PI).

  • If you encounter a KeyError for smiles_key or id_key, verify the input DataFrame column names and pass the correct keys to Inference.run().

Reproducibility checklist

  • Fix random_state in the pipeline and any downstream components that accept a seed.

  • Pin environment versions (Python, RDKit, scikit-learn, xgboost/optuna)

  • Save artifacts from runs (pipeline.save_dir) — the folder includes model, CV results, Optuna study and plots.

  • When comparing runs, keep alpha / CV settings / optimizer budget identical.

Troubleshooting

  • If predictions are unexpectedly constant or many samples are marked out in AD, inspect: - preprocessing logs (duplicates / missing / low-variance steps), - feature generation (are features identical?), - applicability-domain thresholds / distance metric settings.

  • If Optuna optimisation produces noisy outcomes, increase n_trials and/or use a deterministic sampler/seed.