Full pipeline (ProQSAR)
The proqsar.qsar.ProQSAR class provides a single-call, opinionated end-to-end QSAR workflow that
chains the core modules (standardization, featurization, preprocessing, splitting, feature selection,
model development, hyperparameter optimisation and evaluation) into a reproducible experiment.
This example demonstrates a reproducible run: training + optimisation followed by inference with prediction intervals and applicability-domain flags.
Training & optimisation (reproducible)
import pandas as pd
from proqsar.qsar import ProQSAR
from proqsar.Config.config import Config
# small reproducible demo dataset (public)
url = "https://raw.githubusercontent.com/Medicine-Artificial-Intelligence/ProQSAR/main/Data/testcase.csv"
data = pd.read_csv(url).iloc[:50, :] # use small sample for quick demo
data['id'] = data.index
# centralised config: change splitter / optimizer settings here
cfgs = Config(
splitter={'option': 'scaffold'},
optimizer={'n_trials': 50}
)
# create pipeline — set random_state for reproducibility
pipeline = ProQSAR(
activity_col='pChEMBL',
id_col='id',
smiles_col='Smiles',
n_jobs=4,
project_name='Demo',
scoring_target='r2',
n_splits=5,
n_repeats=5,
config=cfgs,
random_state=42
)
# run the full training + optimisation pipeline (alpha used for internal statistical tests)
result = pipeline.run_all(pd.DataFrame(data), alpha=0.05)
Notes:
- For full reproducibility ensure: fixed random_state; deterministic platform (same Python/RDKit/NumPy versions); and a fixed Optuna seed when running long studies (set in Config/optimizer if supported).
- Use a small n_trials and CV folds for quick debugging; increase for production.
Pipeline summary (object repr)
After successful training the pipeline prints a concise summary. Example:
┌────────────────────────────────────────────────────────────────────┐
│ ProQSAR Pipeline │
├────────────────────────────────────────────────────────────────────┤
│ Project: Demo │
│ Save Dir: Project/Demo │
│ Selected feature: 'RDK5' │
│ Fitted: True │
│ Models registered: 1 │
│ Selected model: 'XGBRegressor' │
│ CV (XGBRegressor): 0.683 ± 0.298 │
│ n_jobs: 4 │
│ scoring_target: 'r2' │
│ Optimizer: enabled ConfPred: enabled AD: enabled │
└────────────────────────────────────────────────────────────────────┘
Key fields:
- Selected feature — name of the selected feature set (e.g. RDK5).
- Selected model — model family chosen after benchmarking.
- CV (...) — cross-validated score ± std (scoring_target).
- Optimizer, ConfPred, AD — flags for hyperparameter search, conformal prediction, and applicability-domain checks.
Inference (predict using Inference)
Use the proqsar.infer.Inference wrapper to run predictions with the same
output format returned by pipeline.predict() (point predictions, conformal
prediction intervals and applicability-domain flags). The wrapper additionally
stores metadata about the last run and prints a compact summary when the
Inference object is printed.
import pandas as pd
from proqsar.infer import Inference
# load test / new set
url = "https://raw.githubusercontent.com/Medicine-Artificial-Intelligence/ProQSAR/main/Data/testcase.csv"
test = pd.read_csv(url)
# optional: create id column if you prefer an explicit id in output
test["id"] = test.index
# wrap the trained pipeline (inplace controls whether input DF may be modified)
infer = Inference(pipeline, inplace=True)
# run inference (id_key=None => index will be used; ground_truth optional)
preds_df = infer.run(
test,
smiles_key="Smiles",
id_key=None, # None means infer will pass-through the index as `id`
ground_truth="pChEMBL", # provide if you want observed activity in output
alpha=0.05 # 95% prediction intervals
)
# preds_df is the same-format table produced by pipeline.predict(...)
print(preds_df.head())
Example output (first rows)
id pChEMBL Predicted value Prediction Interval (alpha=0.05) Applicability domain
0 7.698970 6.720965 [4.429, 8.584] in
1 6.576754 7.760520 [5.270, 9.386] in
2 5.970000 6.018114 [4.426, 8.542] in
3 5.602060 5.681627 [3.624, 7.740] in
4 5.397940 5.718508 [3.785, 7.902] in
...
49 5.761954 5.681627 [3.624, 7.740] in
Column explanations
id— input sample identifier (passed-through). Whenid_key=None, the DataFrame index is used and passed through asid.pChEMBL— observed activity if present (passed-through / used for evaluation).Predicted value— model point prediction (mean/median depending on estimator/wrapping).Prediction Interval (alpha=0.05)— conformal prediction interval for the chosenalpha.Applicability domain— in/out flag indicating whether the sample lies within the model’s AD.
Printing the Inference object (compact summary)
After running Inference.run(), printing the Inference object
displays a compact summary for the last run (row count, AD split, prediction
statistics and quantiles). Example:
┌────────────────────────────────────────────────────────────────────────┐
│ Inference (ProQSAR) │
├────────────────────────────────────────────────────────────────────────┤
│ Project: Demo │
│ Save Dir: Project/Demo │
│ Selected feature: 'RDK5' │
│ Last run (rows): 50 │
│ Applicability domain column: Applicability domain │
│ AD: in=50 (100.00%) out=0 │
│ Predictions — mean: 5.978 std: 1.208 nan%: 0.00% │
│ Quantiles (10/50/90): 4.619 / 5.730 / 7.778 │
└────────────────────────────────────────────────────────────────────────┘
Notes
inplace: whenTruethe input DataFrame may be modified in-place; setinplace=Falseto preserve the original.alpha: conformal prediction level (e.g.0.05→ 95% PI).If you encounter a
KeyErrorforsmiles_keyorid_key, verify the input DataFrame column names and pass the correct keys toInference.run().
Reproducibility checklist
Fix
random_statein the pipeline and any downstream components that accept a seed.Pin environment versions (Python, RDKit, scikit-learn, xgboost/optuna)
Save artifacts from runs (
pipeline.save_dir) — the folder includes model, CV results, Optuna study and plots.When comparing runs, keep
alpha/ CV settings / optimizer budget identical.
Troubleshooting
If predictions are unexpectedly constant or many samples are marked
outin AD, inspect: - preprocessing logs (duplicates / missing / low-variance steps), - feature generation (are features identical?), - applicability-domain thresholds / distance metric settings.If Optuna optimisation produces noisy outcomes, increase
n_trialsand/or use a deterministic sampler/seed.